[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13555


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66383008
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}"
--- End diff --

Done, Thanks very much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66381875
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}"
--- End diff --

nit: just `dir.getCanonicalPath`, no need to wrap it inside `""`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66366855
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { df =>
+assert(df.schema.json.contains("\"key\":\"value\""))
--- End diff --

@cloud-fan Thanks for your comments, I have changed the test case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66366487
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
--- End diff --

ok, I will make change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66366470
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { df =>
+assert(df.schema.json.contains("\"key\":\"value\""))
--- End diff --

sure, I will do that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66346568
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { df =>
+assert(df.schema.json.contains("\"key\":\"value\""))
--- End diff --

can we check this more explicitly? e.g. 
`df.schema.last.metadata.getString("key") == "value"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66346435
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,21 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val df = Seq((1, "abc"), (2, "hello")).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select('a, 'b.as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
--- End diff --

I think `dir.getCanonicalPath` is good here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66265069
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val data = (1, "abc") ::(2, "helloabcde") :: Nil
+val df = spark.createDataFrame(data).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select(Column("a"), Column("b").as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { dfwithmeta2 =>
--- End diff --

ok.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66264754
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val data = (1, "abc") ::(2, "helloabcde") :: Nil
+val df = spark.createDataFrame(data).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select(Column("a"), Column("b").as("b", md))
--- End diff --

I will change. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66264631
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val data = (1, "abc") ::(2, "helloabcde") :: Nil
+val df = spark.createDataFrame(data).toDF("a", "b")
--- End diff --

sure, I will do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66235235
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val data = (1, "abc") ::(2, "helloabcde") :: Nil
+val df = spark.createDataFrame(data).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select(Column("a"), Column("b").as("b", md))
--- End diff --

Use `df.select('a, 'b.as("b", md))` instead. It's more compact and 
"modern". Our tests should promote the latest API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66235290
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val data = (1, "abc") ::(2, "helloabcde") :: Nil
+val df = spark.createDataFrame(data).toDF("a", "b")
+val md = new MetadataBuilder().putString("key", "value").build()
+val dfWithmeta = df.select(Column("a"), Column("b").as("b", md))
+
+withTempPath { dir =>
+  val path = s"${dir.getCanonicalPath}/data"
+  dfWithmeta.write.parquet(path)
+
+  readParquetFile(path) { dfwithmeta2 =>
--- End diff --

Replace `dfwithmeta2` to `df`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/13555#discussion_r66234892
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
 ---
@@ -625,6 +625,22 @@ class ParquetQuerySuite extends QueryTest with 
ParquetTest with SharedSQLContext
   }
 }
   }
+
+  test("SPARK-15804: write out the metadata to parquet file") {
+val data = (1, "abc") ::(2, "helloabcde") :: Nil
+val df = spark.createDataFrame(data).toDF("a", "b")
--- End diff --

Merge the lines 630 and 631 to `Seq((1,"abc"),(2,"hello")).toDF("a", "b")` 
instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13555: [SPARK-15804][SQL]Include metadata in the toStruc...

2016-06-08 Thread kevinyu98
GitHub user kevinyu98 opened a pull request:

https://github.com/apache/spark/pull/13555

[SPARK-15804][SQL]Include metadata in the toStructType 

## What changes were proposed in this pull request?
The help function 'toStructType' in the AttributeSeq class doesn't include 
the metadata when it builds the StructField, so it causes this reported problem 
https://issues.apache.org/jira/browse/SPARK-15804?jql=project%20%3D%20SPARK 
when spark writes the the dataframe with the metadata to the parquet 
datasource.  

The code path is when spark writes the dataframe to the parquet datasource 
through the InsertIntoHadoopFsRelationCommand, spark will build the 
WriteRelation container, and it will call the help function 'toStructType' to 
create StructType which contains StructField, it should include the metadata 
there, otherwise, we will lost the user provide metadata. 


## How was this patch tested?

added test case in ParquetQuerySuite.scala


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kevinyu98/spark spark-15804

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13555.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13555


commit 3b44c5978bd44db986621d3e8511e9165b66926b
Author: Kevin Yu 
Date:   2016-04-20T18:06:30Z

adding testcase

commit 18b4a31c687b264b50aa5f5a74455956911f738a
Author: Kevin Yu 
Date:   2016-04-22T21:48:00Z

Merge remote-tracking branch 'upstream/master'

commit 4f4d1c8f2801b1e662304ab2b33351173e71b427
Author: Kevin Yu 
Date:   2016-04-23T16:50:19Z

Merge remote-tracking branch 'upstream/master'
get latest code from upstream

commit f5f0cbed1eb5754c04c36933b374c3b3d2ae4f4e
Author: Kevin Yu 
Date:   2016-04-23T22:20:53Z

Merge remote-tracking branch 'upstream/master'
adding trim characters support

commit d8b2edbd13ee9a4f057bca7dcb0c0940e8e867b8
Author: Kevin Yu 
Date:   2016-04-25T20:24:33Z

Merge remote-tracking branch 'upstream/master'
get latest code for pr12646

commit 196b6c66b0d55232f427c860c0e7c6876c216a67
Author: Kevin Yu 
Date:   2016-04-25T23:45:57Z

Merge remote-tracking branch 'upstream/master'
merge latest code

commit f37a01e005f3e27ae2be056462d6eb6730933ba5
Author: Kevin Yu 
Date:   2016-04-27T14:15:06Z

Merge remote-tracking branch 'upstream/master'
merge upstream/master

commit bb5b01fd3abeea1b03315eccf26762fcc23f80c0
Author: Kevin Yu 
Date:   2016-04-30T23:49:31Z

Merge remote-tracking branch 'upstream/master'

commit bde5820a181cf84e0879038ad8c4cebac63c1e24
Author: Kevin Yu 
Date:   2016-05-04T03:52:31Z

Merge remote-tracking branch 'upstream/master'

commit 5f7cd96d495f065cd04e8e4cc58461843e45bc8d
Author: Kevin Yu 
Date:   2016-05-10T21:14:50Z

Merge remote-tracking branch 'upstream/master'

commit 893a49af0bfd153ccb59ba50b63a232660e0eada
Author: Kevin Yu 
Date:   2016-05-13T18:20:39Z

Merge remote-tracking branch 'upstream/master'

commit 4bbe1fd4a3ebd50338ccbe07dc5887fe289cd53d
Author: Kevin Yu 
Date:   2016-05-17T21:58:14Z

Merge remote-tracking branch 'upstream/master'

commit b2dd795e23c36cbbd022f07a10c0cf21c85eb421
Author: Kevin Yu 
Date:   2016-05-18T06:37:13Z

Merge remote-tracking branch 'upstream/master'

commit 8c3e5da458dbff397ed60fcb68f2a46d87ab7ba4
Author: Kevin Yu 
Date:   2016-05-18T16:18:16Z

Merge remote-tracking branch 'upstream/master'

commit a0eaa408e847fbdc3ac5b26348588ee0a1e276c7
Author: Kevin Yu 
Date:   2016-05-19T04:28:20Z

Merge remote-tracking branch 'upstream/master'

commit d03c940ed89795fa7fe1d1e9f511363b22cdf19d
Author: Kevin Yu 
Date:   2016-05-19T21:24:33Z

Merge remote-tracking branch 'upstream/master'

commit d728d5e002082e571ac47292226eb8b2614f479f
Author: Kevin Yu 
Date:   2016-05-24T20:32:57Z

Merge remote-tracking branch 'upstream/master'

commit ea104ddfbf7d180ed1bc53dd9a1005010264aa1f
Author: Kevin Yu 
Date:   2016-05-25T22:52:57Z

Merge remote-tracking branch 'upstream/master'

commit 6ab1215b781ad0cccf1752f3a625b4e4e371c38e
Author: Kevin Yu 
Date:   2016-05-27T17:18:46Z

Merge remote-tracking branch 'upstream/master'

commit 0c566533705331697eb1b287b30c8b16111f6fa2
Author: Kevin Yu 
Date:   2016-06-01T06:48:57Z

Merge remote-tracking branch 'upstream/master'

commit d7a187490b31185d0a803cbbdeda67cb26c40056
Author: