[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

Owen O'Malley (JIRA) Wed, 08 Mar 2017 10:51:13 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15901769#comment-15901769
 ]


Owen O'Malley commented on SPARK-15474:
---------------------------------------

Ok, Hive's use is fine because it gets the schema from the metastore and only 
matters for schema evolution, which isn't relevant if there are no rows.

In fact, it gets worse in newer versions of Hive where the OrcOutputFormat will 
write 0 byte files and OrcInputFormat will ignore 0 bytes files for reading. 
(The reason behind needing the files at all are an interesting bit of Hive 
history, but not relevant for this.)

The real fix is that Spark needs to use OrcFile.createWriter(...) API to write 
the files rather than Hive's OrcOutputFormat. The OrcFile API lets the caller 
set the schema directly.

>  ORC data source fails to write and read back empty dataframe
> -------------------------------------------------------------
>
>                 Key: SPARK-15474
>                 URL: https://issues.apache.org/jira/browse/SPARK-15474
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Hyukjin Kwon
>
> Currently ORC data source fails to write and read empty data.
> The code below:
> {code}
> val emptyDf = spark.range(10).limit(0)
> emptyDf.write
>   .format("orc")
>   .save(path.getCanonicalPath)
> val copyEmptyDf = spark.read
>   .format("orc")
>   .load(path.getCanonicalPath)
> copyEmptyDf.show()
> {code}
> throws an exception below:
> {code}
> Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC at 
> /private/var/folders/9j/gf_c342d7d150mwrxvkqnc180000gn/T/spark-5b7aa45b-a37d-43e9-975e-a15b36b370da.
>  It must be specified manually;
>       at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>       at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$16.apply(DataSource.scala:352)
>       at scala.Option.getOrElse(Option.scala:121)
>       at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:351)
>       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:130)
>       at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:140)
>       at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:892)
>       at 
> org.apache.spark.sql.sources.HadoopFsRelationTest$$anonfun$32$$anonfun$apply$mcV$sp$47.apply(HadoopFsRelationTest.scala:884)
>       at 
> org.apache.spark.sql.test.SQLTestUtils$class.withTempPath(SQLTestUtils.scala:114)
> {code}
> Note that this is a different case with the data below
> {code}
> val emptyDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
> {code}
> In this case, any writer is not initialised and created. (no calls of 
> {{WriterContainer.writeRows()}}.
> For Parquet and JSON, it works but ORC does not.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15474) ORC data source fails to write and read back empty dataframe

Reply via email to