[jira] [Commented] (SPARK-35592) Spark creates only _SUCCESS file after empty dataFrame is saved as parquet for partitioned data

Apache Spark (Jira) Tue, 08 Jun 2021 02:09:27 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-35592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359170#comment-17359170
 ]


Apache Spark commented on SPARK-35592:
--------------------------------------

User 'cfmcgrady' has created a pull request for this issue:
https://github.com/apache/spark/pull/32818

> Spark creates only _SUCCESS file after empty dataFrame is saved as parquet 
> for partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35592
>                 URL: https://issues.apache.org/jira/browse/SPARK-35592
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: VijayBhakuni
>            Priority: Minor
>
> Whenever an empty dataframe is saved as a parquet file with partitions, the 
> target directory only contains _SUCCESS file.
> Assuming, the dataframe has 3 columns:
>  some_column_1, some_column_2, some_partition_column_1
> and the target location for dataframe is /user/spark/df_name
> *Current Result*:  /user/spark/df_name/_SUCCESS
> *Expected Result*: 
> /user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet
> where that parquet file will have the schema for the data.
> This approach makes sure that any job reading this data doesn't get failed 
> due to:
>  Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for Parquet. It must be specified manually.
> *Steps for reproduce (Scala)*:
>  
> {code:java}
> // create an empty DF with schema
> val inputDF = Seq(
>   ("value1", "value2", "partition1"),
>   ("value3", "value4", "partition2"))
>   .toDF("some_column_1", "some_column_2", "some_partition_column_1")
>   .where("1==2")
> // write dataframe into partitions
> inputDF.write
>   .partitionBy("some_partition_column_1")
>   .mode(SaveMode.Overwrite)
>   .parquet("/user/spark/df_name")
> // Read dataframe
> // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema 
> for // Parquet. It must be specified manually.
> val readDF = spark.read.parquet("/user/spark/df_name")
> {code}
>  
> Similar issue was created under below Jira ticket but it was only for 
> non-partitioned data.
> https://issues.apache.org/jira/browse/SPARK-23271
> We need a similar implementation for partitioned target as well.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35592) Spark creates only _SUCCESS file after empty dataFrame is saved as parquet for partitioned data

Reply via email to