[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r166540368 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala --- @@ -32,6 +33,24 @@ class FileFormatWriterSuite extends QueryTest with SharedSQLContext { } } + test("SPARK-23271 empty dataframe when saved in parquet should write a metadata only file") { +withTempDir { inputPath => + withTempPath { outputPath => +val anySchema = StructType(StructField("anyName", StringType) :: Nil) +val df = spark.read.schema(anySchema).csv(inputPath.toString) +df.write.parquet(outputPath.toString) +val partFiles = outputPath.listFiles() + .filter(f => f.isFile && !f.getName.startsWith(".") && !f.getName.startsWith("_")) +assert(partFiles.length === 1) + +// Now read the file. +val df1 = spark.read.parquet(outputPath.toString) --- End diff -- Nit: extra space before `df1` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/20525#discussion_r166540285 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala --- @@ -301,7 +301,6 @@ class DataFrameReaderWriterSuite extends QueryTest with SharedSQLContext with Be intercept[AnalysisException] { spark.range(10).write.format("csv").mode("overwrite").partitionBy("id").save(path) } - spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path) } --- End diff -- Its not legal to write an empty struct in parquet. Its explained by Herman in [SPARK-20593](https://issues.apache.org/jira/browse/SPARK-20593). Previously, we didn't setup a write task for this where as now with this fix we do. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/20525 SPARK-23271 Parquet output contains only _SUCCESS file after writing an empty dataframe ## What changes were proposed in this pull request? Below are the two cases. ``` SQL case 1 scala> List.empty[String].toDF().rdd.partitions.length res18: Int = 1 ``` When we write the above data frame as parquet, we create a parquet file containing just the schema of the data frame. Case 2 ``` SQL scala> val anySchema = StructType(StructField("anyName", StringType, nullable = false) :: Nil) anySchema: org.apache.spark.sql.types.StructType = StructType(StructField(anyName,StringType,false)) scala> spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length res22: Int = 0 ``` For the 2nd case, since number of partitions = 0, we don't call the write task (the task has logic to create the empty metadata only parquet file) The fix attempts to repartition the empty rdd to size 1 before we proceed to setup the write job. ## How was this patch tested? A new test is added to DataframeReaderWriterSuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark spark-23271 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20525.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20525 commit 2764b1c0aa43a104393da909f388861209220d4f Author: Dilip Biswal Date: 2018-02-07T07:45:32Z SPARK-23271 Parquet output contains only _SUCCESS file after writing an empty dataframe --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org