[ https://issues.apache.org/jira/browse/SPARK-28208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-28208: ------------------------------------ Assignee: (was: Apache Spark) > When upgrading to ORC 1.5.6, the reader needs to be closed. > ----------------------------------------------------------- > > Key: SPARK-28208 > URL: https://issues.apache.org/jira/browse/SPARK-28208 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Owen O'Malley > Priority: Major > > As part of the ORC 1.5.6 release, we optimized the common pattern of: > {code:java} > Reader reader = OrcFile.createReader(...); > RecordReader rows = reader.rows(...);{code} > which used to open one file handle in the Reader and a second one in the > RecordReader. Users were seeing this as a regression when moving from the old > Spark ORC reader via hive to the new native reader, because it opened twice > as many files on the NameNode. > In ORC 1.5.6, we changed the ORC library so that it keeps the file handle in > the Reader until it is either closed or a RecordReader is created from it. > This has cut down the number of file open requests on the NameNode by half in > typical spark applications. (Hive's ORC code avoided this problem by putting > the file footer in to the input splits, but that has other problems.) > To get the new optimization without leaking file handles, Spark needs to be > close the readers that aren't used to create RecordReaders. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org