spark git commit: [SPARK-15804][SQL] Include metadata in the toStructType

wenchen Thu, 09 Jun 2016 09:51:02 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 77c08d224 -> eb9e8fc09



[SPARK-15804][SQL] Include metadata in the toStructType

## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include the 
metadata when it builds the StructField, so it causes this reported problem 
https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK 
when spark writes the the dataframe with the metadata to the parquet datasource.

The code path is when spark writes the dataframe to the parquet datasource 
through the InsertIntoHadoopFsRelationCommand, spark will build the 
WriteRelation container, and it will call the help function 'toStructType' to 
create StructType which contains StructField, it should include the metadata 
there, otherwise, we will lost the user provide metadata.

## How was this patch tested?

added test case in ParquetQuerySuite.scala

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)

Author: Kevin Yu <q...@us.ibm.com>

Closes #13555 from kevinyu98/spark-15804.

(cherry picked from commit 99386fe3989f758844de14b2c28eccfdf8163221)
Signed-off-by: Wenchen Fan <wenc...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eb9e8fc0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eb9e8fc0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eb9e8fc0

Branch: refs/heads/branch-2.0
Commit: eb9e8fc097384dbe0d2cb83ca5b80968e3539c78
Parents: 77c08d2
Author: Kevin Yu <q...@us.ibm.com>
Authored: Thu Jun 9 09:50:09 2016 -0700
Committer: Wenchen Fan <wenc...@databricks.com>
Committed: Thu Jun 9 09:50:19 2016 -0700

----------------------------------------------------------------------
 .../spark/sql/catalyst/expressions/package.scala     |  2 +-
 .../datasources/parquet/ParquetQuerySuite.scala      | 15 +++++++++++++++
 2 files changed, 16 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/eb9e8fc0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
----------------------------------------------------------------------
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
index 81f5bb4..a6125c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala
@@ -91,7 +91,7 @@ package object expressions  {
   implicit class AttributeSeq(val attrs: Seq[Attribute]) extends Serializable {
     /** Creates a StructType with a schema matching this `Seq[Attribute]`. */
     def toStructType: StructType = {
-      StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable)))
+      StructType(attrs.map(a => StructField(a.name, a.dataType, a.nullable, 
a.metadata)))
     }
 
     // It's possible that `attrs` is a linked list, which can lead to bad 
O(n^2) loops when

http://git-wip-us.apache.org/repos/asf/spark/blob/eb9e8fc0/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
index 78b97f6..ea57f71 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with ParquetTest 
with SharedSQLContext
       }
     }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+    val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+    val md = new MetadataBuilder().putString("key", "value").build()
+    val dfWithmeta = df.select('a, 'b.as("b", md))
+
+    withTempPath { dir =>
+      val path = dir.getCanonicalPath
+      dfWithmeta.write.parquet(path)
+
+      readParquetFile(path) { df =>
+        assert(df.schema.last.metadata.getString("key") == "value")
+      }
+    }
+  }
 }
 
 object TestingUDT {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-15804][SQL] Include metadata in the toStructType

Reply via email to