subject:"\[GitHub\] spark pull request #20525\: SPARK\-23271 Parquet output contains only

[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...

2018-02-06 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20525#discussion_r166540368
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala
 ---
@@ -32,6 +33,24 @@ class FileFormatWriterSuite extends QueryTest with 
SharedSQLContext {
 }
   }
 
+  test("SPARK-23271 empty dataframe when saved in parquet should write a 
metadata only file") {
+withTempDir { inputPath =>
+  withTempPath { outputPath =>
+val anySchema = StructType(StructField("anyName", StringType) :: 
Nil)
+val df = spark.read.schema(anySchema).csv(inputPath.toString)
+df.write.parquet(outputPath.toString)
+val partFiles = outputPath.listFiles()
+  .filter(f => f.isFile && !f.getName.startsWith(".") && 
!f.getName.startsWith("_"))
+assert(partFiles.length === 1)
+
+// Now read the file.
+val  df1 = spark.read.parquet(outputPath.toString)
--- End diff --

Nit: extra space before `df1`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...

2018-02-06 Thread dilipbiswal

Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/20525#discussion_r166540285
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -301,7 +301,6 @@ class DataFrameReaderWriterSuite extends QueryTest with 
SharedSQLContext with Be
   intercept[AnalysisException] {
 
spark.range(10).write.format("csv").mode("overwrite").partitionBy("id").save(path)
   }
-  
spark.emptyDataFrame.write.format("parquet").mode("overwrite").save(path)
 }
--- End diff --

Its not legal to write an empty struct in parquet.  Its explained by Herman 
in [SPARK-20593](https://issues.apache.org/jira/browse/SPARK-20593). 
Previously, we didn't setup a write 
task for this where as now with this fix we do. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...

2018-02-06 Thread dilipbiswal

GitHub user dilipbiswal opened a pull request:

https://github.com/apache/spark/pull/20525

SPARK-23271 Parquet output contains only _SUCCESS file after writing an 
empty dataframe

## What changes were proposed in this pull request?
Below are the two cases. 
``` SQL
case 1

scala> List.empty[String].toDF().rdd.partitions.length
res18: Int = 1
```
When we write the above data frame as parquet, we create a parquet file 
containing
just the schema of the data frame. 

Case 2
``` SQL

scala> val anySchema = StructType(StructField("anyName", StringType, 
nullable = false) :: Nil)
anySchema: org.apache.spark.sql.types.StructType = 
StructType(StructField(anyName,StringType,false))
scala> 
spark.read.schema(anySchema).csv("/tmp/empty_folder").rdd.partitions.length
res22: Int = 0
```
For the 2nd case, since number of partitions = 0, we don't call the write 
task (the task has logic to create the empty metadata only parquet file)

The fix attempts to repartition the empty rdd to size 1 before we proceed 
to setup the write
job. 

## How was this patch tested?

A new test is added to DataframeReaderWriterSuite.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dilipbiswal/spark spark-23271

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20525


commit 2764b1c0aa43a104393da909f388861209220d4f
Author: Dilip Biswal 
Date:   2018-02-07T07:45:32Z

SPARK-23271 Parquet output contains only _SUCCESS file after writing an 
empty dataframe




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...

[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...

[GitHub] spark pull request #20525: SPARK-23271 Parquet output contains only _SUCCESS...

3 matches

Site Navigation

Mail list logo

Footer information