Owen O'Malley created SPARK-28208:
-------------------------------------

             Summary: When upgrading to ORC 1.5.6, the reader needs to be 
closed.
                 Key: SPARK-28208
                 URL: https://issues.apache.org/jira/browse/SPARK-28208
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Owen O'Malley


As part of the ORC 1.5.6 release, we optimized the common pattern of:
{code:java}
Reader reader = OrcFile.createReader(...);
RecordReader rows = reader.rows(...);{code}

which used to open one file handle in the Reader and a second one in the 
RecordReader. Users were seeing this as a regression when moving from the old 
Spark ORC reader via hive to the new native reader, because it opened twice as 
many files on the NameNode.

In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in 
the Reader until it is either closed or a RecordReader is created from it. This 
has cut down the number of file open requests on the NameNode by half in 
typical spark applications. (Hive's ORC code avoided this problem by putting 
the file footer in to the input splits, but that has other problems.)

To get the new optimization without leaking file handles, Spark needs to be 
close the readers that aren't used to create RecordReaders.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to