[jira] [Commented] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

Jie Huang (JIRA) Wed, 01 Jun 2016 02:29:19 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15310001#comment-15310001
 ]


Jie Huang commented on SPARK-15393:
-----------------------------------

E.g., in hive, if we add a location without any parquet file there, it won't 
load anything into the table. Even we have assigned that table stored as 
parquet. There is no warning or exception while we do the projection or other 
queries against the table. 

> Writing empty Dataframes doesn't save any _metadata files
> ---------------------------------------------------------
>
>                 Key: SPARK-15393
>                 URL: https://issues.apache.org/jira/browse/SPARK-15393
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Jurriaan Pruis
>            Priority: Critical
>
> Writing empty dataframes is broken on latest master.
> It omits the metadata and sometimes throws the following exception (when 
> saving as parquet):
> {code}
> 8-May-2016 22:37:14 WARNING: 
> org.apache.parquet.hadoop.ParquetOutputCommitter: could not write summary 
> file for file:/some/test/file
> java.lang.NullPointerException
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
>     at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
>     at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
>     at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>     at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:220)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:144)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:115)
>     at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>     at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:115)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>     at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>     at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>     at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>     at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>     at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>     at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>     at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>     at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>     at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>     at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
>     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
>     at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
>     at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:498)
>     at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>     at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>     at py4j.Gateway.invoke(Gateway.java:280)
>     at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>     at py4j.commands.CallCommand.execute(CallCommand.java:79)
>     at py4j.GatewayConnection.run(GatewayConnection.java:211)
>     at java.lang.Thread.run(Thread.java:745)
> {code}
> It only saves an _SUCCESS file (which is also incorrect behaviour, because it 
> raised an exception).
> This means that loading it again will result in the following error:
> {code}
> Unable to infer schema for ParquetFormat at /some/test/file. It must be 
> specified manually;'
> {code}
> It looks like this problem was introduced in 
> https://github.com/apache/spark/pull/12855 (SPARK-10216).
> After reverting those changes I could save the empty dataframe as parquet and 
> load it again without Spark complaining or throwing any exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15393) Writing empty Dataframes doesn't save any _metadata files

Reply via email to