GitHub user dilipbiswal opened a pull request:

    https://github.com/apache/spark/pull/20579

    [SPARK-23372][SQL] Writing empty struct in parquet fails during execution. 
It should fail earlier in the processing.

    ## What changes were proposed in this pull request?
    Running
    spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
    Results in
    ``` SQL
    org.apache.parquet.schema.InvalidSchemaException: Cannot write a schema 
with an empty group: message spark_schema {
     }
    
    at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:27)
     at org.apache.parquet.schema.TypeUtil$1.visit(TypeUtil.java:37)
     at org.apache.parquet.schema.MessageType.accept(MessageType.java:58)
     at 
org.apache.parquet.schema.TypeUtil.checkValidWriteSchema(TypeUtil.java:23)
     at 
org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:225)
     at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:342)
     at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:302)
     at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
     at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:151)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:376)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:387)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:278)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:276)
     at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:281)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:206)
     at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:205)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
     at org.apache.spark.scheduler.Task.run(Task.scala:109)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
     at java.lang.Thread.run(Thread.
    ```
    
    This PR addresses a couple of things.
    1) The above case now fails earlier during processing during the prep write 
phase.
    2) Writing an empty data frame in ORC succeeds but fails during read while 
inferring the schema.
        This issue is also addressed in this PR.
    
    ## How was this patch tested?
    
    Unit tests added in FileBasedDatasourceSuite.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dilipbiswal/spark spark-23372

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20579.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20579
    
----
commit 9f7a1705960250cf6a828787f0f12a9f28b608c5
Author: Dilip Biswal <dbiswal@...>
Date:   2018-02-11T17:09:07Z

    [SPARK-23372] Writing empty struct in parquet fails during execution. It 
should fail earlier in the processing

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to