[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-29 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r51284828
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -359,6 +361,18 @@ object StructType extends AbstractDataType {
   protected[sql] def fromAttributes(attributes: Seq[Attribute]): 
StructType =
 StructType(attributes.map(a => StructField(a.name, a.dataType, 
a.nullable, a.metadata)))
 
+  def removeMetadata(key: String, dt: DataType): DataType =
+dt match {
+  case StructType(fields) =>
+val newFields = fields.map { f =>
+  val mb = new MetadataBuilder()
--- End diff --

This workaround may be taken out in the future.

However, use of MetadataBuilder occurs in many other places:
http://pastebin.com/nVjNfrgp

I feel adding clear() to MetadataBuilder would help in current and future 
use cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r51235044
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -359,6 +361,18 @@ object StructType extends AbstractDataType {
   protected[sql] def fromAttributes(attributes: Seq[Attribute]): 
StructType =
 StructType(attributes.map(a => StructField(a.name, a.dataType, 
a.nullable, a.metadata)))
 
+  def removeMetadata(key: String, dt: DataType): DataType =
+dt match {
+  case StructType(fields) =>
+val newFields = fields.map { f =>
+  val mb = new MetadataBuilder()
--- End diff --

Agreed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-28 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r51232182
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -359,6 +361,18 @@ object StructType extends AbstractDataType {
   protected[sql] def fromAttributes(attributes: Seq[Attribute]): 
StructType =
 StructType(attributes.map(a => StructField(a.name, a.dataType, 
a.nullable, a.metadata)))
 
+  def removeMetadata(key: String, dt: DataType): DataType =
+dt match {
+  case StructType(fields) =>
+val newFields = fields.map { f =>
+  val mb = new MetadataBuilder()
--- End diff --

This PR is mostly a workaround for a parquet-mr bug (PARQUET-389), and I'd 
assume that it will be fixed in the near future. Then we can remove this 
workaround. So it doesn't seem to be worth modifying `MetadataBuilder`, which 
is part of public API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-28 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r51213703
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala
 ---
@@ -258,7 +258,12 @@ private[sql] class ParquetRelation(
 job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])
 
 ParquetOutputFormat.setWriteSupportClass(job, 
classOf[CatalystWriteSupport])
-CatalystWriteSupport.setSchema(dataSchema, conf)
+
+// We want to clear this temporary metadata from saving into Parquet 
file.
+// This metadata is only useful for detecting optional columns when 
pushdowning filters.
--- End diff --

nit: pushdowning -> pushing down


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-28 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r51213506
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala ---
@@ -359,6 +361,18 @@ object StructType extends AbstractDataType {
   protected[sql] def fromAttributes(attributes: Seq[Attribute]): 
StructType =
 StructType(attributes.map(a => StructField(a.name, a.dataType, 
a.nullable, a.metadata)))
 
+  def removeMetadata(key: String, dt: DataType): DataType =
+dt match {
+  case StructType(fields) =>
+val newFields = fields.map { f =>
+  val mb = new MetadataBuilder()
--- End diff --

If we add clear() to MetadataBuilder, this can be lifted above the 
fields.map. Inside the map operation we just clear the MetadataBuilder.

What do you think ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9940


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-28 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-176495087
  
Thanks! I'm going to merging this to master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-27 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-175927505
  
ping @liancheng Please see if latest updates are proper for you. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-175567984
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/50190/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-175567983
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-175567496
  
**[Test build #50190 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50190/consoleFull)**
 for PR 9940 at commit 
[`1a11770`](https://github.com/apache/spark/commit/1a11770a19b8a6ec263ca3c34def5b03ec4483c9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-175524399
  
**[Test build #50190 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50190/consoleFull)**
 for PR 9940 at commit 
[`1a11770`](https://github.com/apache/spark/commit/1a11770a19b8a6ec263ca3c34def5b03ec4483c9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-172972369
  
Overall LGTM except for several minor issues.

Another thing is that, we probably want to use a more special name 
(something like `_OPTIONAL_`) to avoid naming conflict with user defined 
metadata keys.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50167918
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
 // If the "c = 1" filter gets pushed down, this query will throw 
an exception which
 // Parquet emits. This is a Parquet issue (PARQUET-389).
+val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a")
 checkAnswer(
-  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a"),
+  df,
   (1 to 1).map(i => Row(i, i.toString, null)))
+
+// The fields "a" and "c" only exist in one Parquet file.
+df.schema.fields.foreach { f =>
+  if (f.name == "a" || f.name == "c") {
+assert(f.metadata.contains("optional"))
+  }
+}
+
+val pathThree = s"${dir.getCanonicalPath}/table3"
+df.write.parquet(pathThree)
+
+// We will remove the temporary metadata when writing Parquet file.
+sqlContext.read.parquet(pathThree).schema.fields.foreach { f =>
+  assert(!f.metadata.contains("optional"))
+}
+
+val pathFour = s"${dir.getCanonicalPath}/table4"
+val dfStruct = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b")
+dfStruct.select(struct("a").as("s")).write.parquet(pathFour)
+
+val pathFive = s"${dir.getCanonicalPath}/table5"
+val dfStruct2 = sparkContext.parallelize(Seq((1, 1))).toDF("c", 
"b")
+dfStruct2.select(struct("c").as("s")).write.parquet(pathFive)
+
+// If the "s.c = 1" filter gets pushed down, this query will throw 
an exception which
+// Parquet emits.
+val dfStruct3 = sqlContext.read.parquet(pathFour, 
pathFive).filter("s.c = 1")
+  .selectExpr("s.c", "s.a")
+checkAnswer(
+  dfStruct3,
+  (1 to 1).map(i => Row(i, null)))
+
+// The fields "s.a" and "s.c" only exist in one Parquet file.
+dfStruct3.schema.fields.foreach { f =>
+  if (f.name == "s.a" || f.name == "s.c") {
+assert(f.metadata.contains("optional"))
+  }
+}
+
+val pathSix = s"${dir.getCanonicalPath}/table6"
+dfStruct3.write.parquet(pathSix)
+
+// We will remove the temporary metadata when writing Parquet file.
+sqlContext.read.parquet(pathSix).schema.fields.foreach { f =>
+  assert(!f.metadata.contains("optional"))
+}
--- End diff --

Similar as the other one mentioned above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50167888
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
 // If the "c = 1" filter gets pushed down, this query will throw 
an exception which
 // Parquet emits. This is a Parquet issue (PARQUET-389).
+val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a")
 checkAnswer(
-  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a"),
+  df,
   (1 to 1).map(i => Row(i, i.toString, null)))
+
+// The fields "a" and "c" only exist in one Parquet file.
+df.schema.fields.foreach { f =>
+  if (f.name == "a" || f.name == "c") {
+assert(f.metadata.contains("optional"))
+  }
+}
+
+val pathThree = s"${dir.getCanonicalPath}/table3"
+df.write.parquet(pathThree)
+
+// We will remove the temporary metadata when writing Parquet file.
+sqlContext.read.parquet(pathThree).schema.fields.foreach { f =>
+  assert(!f.metadata.contains("optional"))
+}
+
+val pathFour = s"${dir.getCanonicalPath}/table4"
+val dfStruct = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b")
+dfStruct.select(struct("a").as("s")).write.parquet(pathFour)
+
+val pathFive = s"${dir.getCanonicalPath}/table5"
+val dfStruct2 = sparkContext.parallelize(Seq((1, 1))).toDF("c", 
"b")
+dfStruct2.select(struct("c").as("s")).write.parquet(pathFive)
+
+// If the "s.c = 1" filter gets pushed down, this query will throw 
an exception which
+// Parquet emits.
+val dfStruct3 = sqlContext.read.parquet(pathFour, 
pathFive).filter("s.c = 1")
+  .selectExpr("s.c", "s.a")
+checkAnswer(
+  dfStruct3,
+  (1 to 1).map(i => Row(i, null)))
+
+// The fields "s.a" and "s.c" only exist in one Parquet file.
+dfStruct3.schema.fields.foreach { f =>
+  if (f.name == "s.a" || f.name == "s.c") {
+assert(f.metadata.contains("optional"))
+  }
+}
--- End diff --

```scala
assert(df.schema("a").metadata.contains("optional"))
assert(df.schema("c").metadata.contains("optional"))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50167793
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
 // If the "c = 1" filter gets pushed down, this query will throw 
an exception which
 // Parquet emits. This is a Parquet issue (PARQUET-389).
+val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a")
 checkAnswer(
-  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a"),
+  df,
   (1 to 1).map(i => Row(i, i.toString, null)))
+
+// The fields "a" and "c" only exist in one Parquet file.
+df.schema.fields.foreach { f =>
+  if (f.name == "a" || f.name == "c") {
+assert(f.metadata.contains("optional"))
+  }
+}
+
+val pathThree = s"${dir.getCanonicalPath}/table3"
+df.write.parquet(pathThree)
+
+// We will remove the temporary metadata when writing Parquet file.
+sqlContext.read.parquet(pathThree).schema.fields.foreach { f =>
+  assert(!f.metadata.contains("optional"))
+}
+
+val pathFour = s"${dir.getCanonicalPath}/table4"
+val dfStruct = sparkContext.parallelize(Seq((1, 1))).toDF("a", "b")
+dfStruct.select(struct("a").as("s")).write.parquet(pathFour)
+
+val pathFive = s"${dir.getCanonicalPath}/table5"
+val dfStruct2 = sparkContext.parallelize(Seq((1, 1))).toDF("c", 
"b")
+dfStruct2.select(struct("c").as("s")).write.parquet(pathFive)
+
+// If the "s.c = 1" filter gets pushed down, this query will throw 
an exception which
+// Parquet emits.
+val dfStruct3 = sqlContext.read.parquet(pathFour, 
pathFive).filter("s.c = 1")
+  .selectExpr("s.c", "s.a")
+checkAnswer(
+  dfStruct3,
+  (1 to 1).map(i => Row(i, null)))
--- End diff --

```scala
checkAnswer(dfStruct3, Row(1, null))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50167182
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
 // If the "c = 1" filter gets pushed down, this query will throw 
an exception which
 // Parquet emits. This is a Parquet issue (PARQUET-389).
+val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a")
 checkAnswer(
-  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a"),
+  df,
   (1 to 1).map(i => Row(i, i.toString, null)))
+
+// The fields "a" and "c" only exist in one Parquet file.
+df.schema.fields.foreach { f =>
+  if (f.name == "a" || f.name == "c") {
+assert(f.metadata.contains("optional"))
+  }
+}
+
+val pathThree = s"${dir.getCanonicalPath}/table3"
+df.write.parquet(pathThree)
+
+// We will remove the temporary metadata when writing Parquet file.
+sqlContext.read.parquet(pathThree).schema.fields.foreach { f =>
+  assert(!f.metadata.contains("optional"))
+}
--- End diff --

```scala

assert(sqlContext.read.parquet(pathThree).schema.forall(_.metadata.contains("optional")))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50166920
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
 // If the "c = 1" filter gets pushed down, this query will throw 
an exception which
 // Parquet emits. This is a Parquet issue (PARQUET-389).
+val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a")
 checkAnswer(
-  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a"),
+  df,
   (1 to 1).map(i => Row(i, i.toString, null)))
--- End diff --

Since you are updating this test case, this line can be simplified to 
`Row(1, "1", null)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50166678
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -330,9 +331,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 
 // If the "c = 1" filter gets pushed down, this query will throw 
an exception which
 // Parquet emits. This is a Parquet issue (PARQUET-389).
+val df = sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a")
 checkAnswer(
-  sqlContext.read.parquet(pathOne, pathTwo).filter("c = 
1").selectExpr("c", "b", "a"),
+  df,
   (1 to 1).map(i => Row(i, i.toString, null)))
+
+// The fields "a" and "c" only exist in one Parquet file.
+df.schema.fields.foreach { f =>
+  if (f.name == "a" || f.name == "c") {
+assert(f.metadata.contains("optional"))
+  }
+}
--- End diff --

I'd prefer the following version. If `a` or `c` doesn't even exist in 
`df.schema`, the current version still passes.

```scala
assert(df.schema("a").metadata.contains("optional"))
assert(df.schema("c").metadata.contains("optional"))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50165611
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -207,11 +207,19 @@ private[sql] object ParquetFilters {
  */
   }
 
+  private def getFieldMap(dataType: DataType): Array[(String, DataType)] = 
dataType match {
--- End diff --

It's not quite intuitive why this method is needed. Can we add a comment 
here to explain why fields with "optional" metadata should be filtered? 
Corresponding JIRA links should also be mentioned.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/9940#discussion_r50164899
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala ---
@@ -122,7 +122,7 @@ class DataTypeSuite extends SparkFunSuite {
 val right = StructType(List())
 val merged = left.merge(right)
 
-assert(merged === left)
+assert(DataType.equalsIgnoreCompatibleNullability(merged, left))
--- End diff --

I guess we resort to `equalsIgnoreCompatibleNullability` because extra 
metadata in `merged`? Can we also add assertion for added metadata instead of 
working around it using `equalsIgnoreCompatibleNullability`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2016-01-19 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-172777996
  
@liancheng Can we revisit this now? Or we want to wait a bit longer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-08 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162843069
  
@liancheng Sure. Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-08 Thread liancheng
Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162839835
  
@viirya Thanks for your efforts! Would you mind me revisiting this after 
1.6 release? I would like see whether we can have PARQUET-389 fixed in Parquet 
community ASAP, so that we may not need to work around it in Spark 1.7/2.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162336101
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47241/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162336100
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162335953
  
**[Test build #47241 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47241/consoleFull)**
 for PR 9940 at commit 
[`40533a7`](https://github.com/apache/spark/commit/40533a79de07f1159d5f5e2a6327a160f278be0e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * `  
public abstract static class PrefixComputer `\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-06 Thread viirya
Github user viirya commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162329711
  
ping @liancheng 

I tried to update this for your suggestions. Now the metadata will be 
removed before saving Parquet file. Besides, I also add more test cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162329133
  
**[Test build #47241 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47241/consoleFull)**
 for PR 9940 at commit 
[`40533a7`](https://github.com/apache/spark/commit/40533a79de07f1159d5f5e2a6327a160f278be0e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162237940
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162237942
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47229/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11955][SQL] Mark optional fields in mer...

2015-12-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9940#issuecomment-162237893
  
**[Test build #47229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47229/consoleFull)**
 for PR 9940 at commit 
[`db8ffa3`](https://github.com/apache/spark/commit/db8ffa3b73c80ec8f311ece5b1b9e17d5d7257ba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org