[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/13807


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67968854
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,28 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
+val serializer = encoders.zipWithIndex.map { case (enc, index) =>
+  val originalInputObject = enc.serializer.head.collect { case b: 
BoundReference => b }.head
--- End diff --

yea, 
https://github.com/apache/spark/pull/13807/files#diff-91c617f2464cea010922328f4cdbbda9R227


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67965175
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,28 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
--- End diff --

this line: 
https://github.com/apache/spark/pull/13463/files#diff-87cabbe4d0c794f02523ecc1764955d0L88
we create struct directly and doesn't consider the null case


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67933216
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,25 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
+val serializer = encoders.zipWithIndex.map { case (enc, index) =>
+  val originalInputObject = enc.serializer.head.collect { case b: 
BoundReference => b }.head
+  val newInputObject = Invoke(
+BoundReference(0, ObjectType(cls), nullable = true),
+s"_${index + 1}",
+originalInputObject.dataType)
+
+  val newSerializer = enc.serializer.map(_.transformUp {
+case b: BoundReference if b == originalInputObject => 
newInputObject
+  })
+
+  if (enc.flat) {
+newSerializer.head
+  } else {
+val struct = CreateStruct(newSerializer)
+val nullCheck = Or(
+  IsNull(newInputObject),
+  Invoke(Literal.fromObject(None), "equals", BooleanType, 
newInputObject :: Nil))
+If(nullCheck, Literal.create(null, struct.dataType), struct)
--- End diff --

Also, let's put examples in the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67932528
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,28 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
--- End diff --

Can you also comment on the line that caused the problem?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67932028
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,28 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
+val serializer = encoders.zipWithIndex.map { case (enc, index) =>
+  val originalInputObject = enc.serializer.head.collect { case b: 
BoundReference => b }.head
--- End diff --

Do we have any assumption at here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67931868
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,28 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
+val serializer = encoders.zipWithIndex.map { case (enc, index) =>
+  val originalInputObject = enc.serializer.head.collect { case b: 
BoundReference => b }.head
--- End diff --

Safe to call `head`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67912070
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -830,6 +830,13 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   ds.dropDuplicates("_1", "_2"),
   ("a", 1), ("a", 2), ("b", 1))
   }
+
+  test("SPARK-16097: Encoders.tuple should handle null object correctly") {
+val enc = Encoders.tuple(Encoders.tuple(Encoders.STRING, 
Encoders.STRING), Encoders.STRING)
+val data = Seq((("a", "b"), "c"), (null, "d"))
+val ds = spark.createDataset(data)(enc)
+checkDataset(ds, (("a", "b"), "c"), (null, "d"))
+  }
--- End diff --

Is this equivalent with the original failed case? Seems not? For a outer 
join case, will we drop the null if `ExpressionEncoder` is not used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67887470
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -830,6 +830,13 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   ds.dropDuplicates("_1", "_2"),
   ("a", 1), ("a", 2), ("b", 1))
   }
+
+  test("SPARK-16097: Encoders.tuple should handle null object correctly") {
+val enc = Encoders.tuple(Encoders.tuple(Encoders.STRING, 
Encoders.STRING), Encoders.STRING)
+val data = Seq((("a", "b"), "c"), (null, "d"))
+val ds = spark.createDataset(data)(enc)
+checkDataset(ds, (("a", "b"), "c"), (null, "d"))
+  }
--- End diff --

Actually that is not a valid test as the `ExpressionEncoder` is private and 
should not be used externally. But it does expose this bug and I added this 
test case which is valid externally.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67880034
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -830,6 +830,13 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   ds.dropDuplicates("_1", "_2"),
   ("a", 1), ("a", 2), ("b", 1))
   }
+
+  test("SPARK-16097: Encoders.tuple should handle null object correctly") {
+val enc = Encoders.tuple(Encoders.tuple(Encoders.STRING, 
Encoders.STRING), Encoders.STRING)
+val data = Seq((("a", "b"), "c"), (null, "d"))
+val ds = spark.createDataset(data)(enc)
+checkDataset(ds, (("a", "b"), "c"), (null, "d"))
+  }
--- End diff --

Shall we add the original outer join test case here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread liancheng
Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/13807#discussion_r67879774
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 ---
@@ -110,16 +110,25 @@ object ExpressionEncoder {
 
 val cls = 
Utils.getContextOrSparkClassLoader.loadClass(s"scala.Tuple${encoders.size}")
 
-val serializer = encoders.map {
-  case e if e.flat => e.serializer.head
-  case other => CreateStruct(other.serializer)
-}.zipWithIndex.map { case (expr, index) =>
-  expr.transformUp {
-case BoundReference(0, t, _) =>
-  Invoke(
-BoundReference(0, ObjectType(cls), nullable = true),
-s"_${index + 1}",
-t)
+val serializer = encoders.zipWithIndex.map { case (enc, index) =>
+  val originalInputObject = enc.serializer.head.collect { case b: 
BoundReference => b }.head
+  val newInputObject = Invoke(
+BoundReference(0, ObjectType(cls), nullable = true),
+s"_${index + 1}",
+originalInputObject.dataType)
+
+  val newSerializer = enc.serializer.map(_.transformUp {
+case b: BoundReference if b == originalInputObject => 
newInputObject
+  })
+
+  if (enc.flat) {
+newSerializer.head
+  } else {
+val struct = CreateStruct(newSerializer)
+val nullCheck = Or(
+  IsNull(newInputObject),
+  Invoke(Literal.fromObject(None), "equals", BooleanType, 
newInputObject :: Nil))
+If(nullCheck, Literal.create(null, struct.dataType), struct)
--- End diff --

This part is quite tricky, let's add comment here to explain why we need to 
substitute the input object and add the extra null check.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13807: [SPARK-16097][SQL] Encoders.tuple should handle n...

2016-06-21 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/13807

[SPARK-16097][SQL] Encoders.tuple should handle null object correctly

## What changes were proposed in this pull request?

Although the top level input object can not be null, but when we use 
`Encoders.tuple` to combine 2 encoders, their input objects are not top level 
anymore and can be null. We should handle this case.


## How was this patch tested?

new test in DatasetSuite


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark bug

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13807.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13807


commit 45f74f624ba8774c3735896915995c4d81785884
Author: Wenchen Fan 
Date:   2016-06-21T11:31:46Z

Encoders.tuple should handle null object correctly




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org