[ https://issues.apache.org/jira/browse/SPARK-25126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16581578#comment-16581578 ]
Apache Spark commented on SPARK-25126: -------------------------------------- User 'raofu' has created a pull request for this issue: https://github.com/apache/spark/pull/22113 > OrcFileOperator.getFileReader: avoid creating OrcFile.Reader for all orc files > ------------------------------------------------------------------------------ > > Key: SPARK-25126 > URL: https://issues.apache.org/jira/browse/SPARK-25126 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.3.1 > Reporter: Rao Fu > Priority: Major > > https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L73 > Where `basePath` passed to getFileReader is a directory, a OrcFile.Reader is > created for every file under the directory although only the first one with a > non-empty schema is returned. It consumes a lot of memory when there are many > files under the directory as the metadata for the orc file is loaded into > memory during the Reader creation. > I tried the following workaround and the OOM issue went away, > 1) create a DataSet<Row> from a single orc file. > Dataset<Row> rowsForFirstFile = spark.read().format("orc").load(oneFile); > 2) when creating DataSet<Row> from all files under the directory, use the > schema from the previous DataSet. > Dataset<Row> rows = > spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path); > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org