[ https://issues.apache.org/jira/browse/SPARK-35592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359170#comment-17359170 ]
Apache Spark commented on SPARK-35592: -------------------------------------- User 'cfmcgrady' has created a pull request for this issue: https://github.com/apache/spark/pull/32818 > Spark creates only _SUCCESS file after empty dataFrame is saved as parquet > for partitioned data > ----------------------------------------------------------------------------------------------- > > Key: SPARK-35592 > URL: https://issues.apache.org/jira/browse/SPARK-35592 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.0 > Reporter: VijayBhakuni > Priority: Minor > > Whenever an empty dataframe is saved as a parquet file with partitions, the > target directory only contains _SUCCESS file. > Assuming, the dataframe has 3 columns: > some_column_1, some_column_2, some_partition_column_1 > and the target location for dataframe is /user/spark/df_name > *Current Result*: /user/spark/df_name/_SUCCESS > *Expected Result*: > /user/spark/df_name/some_partition_column_1=_HIVE_DEFAULT_PARTITION_/<some_spark_generated_file_name>.snappy.parquet > where that parquet file will have the schema for the data. > This approach makes sure that any job reading this data doesn't get failed > due to: > Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema > for Parquet. It must be specified manually. > *Steps for reproduce (Scala)*: > > {code:java} > // create an empty DF with schema > val inputDF = Seq( > ("value1", "value2", "partition1"), > ("value3", "value4", "partition2")) > .toDF("some_column_1", "some_column_2", "some_partition_column_1") > .where("1==2") > // write dataframe into partitions > inputDF.write > .partitionBy("some_partition_column_1") > .mode(SaveMode.Overwrite) > .parquet("/user/spark/df_name") > // Read dataframe > // Exception: org.apache.spark.sql.AnalysisException: Unable to infer schema > for // Parquet. It must be specified manually. > val readDF = spark.read.parquet("/user/spark/df_name") > {code} > > Similar issue was created under below Jira ticket but it was only for > non-partitioned data. > https://issues.apache.org/jira/browse/SPARK-23271 > We need a similar implementation for partitioned target as well. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org