[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r72009385
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer(
 
 /**
  * Pushes projects down beneath Sample to enable column pruning with 
sampling.
+ * This rule is only doable when the projects don't add new attributes.
  */
 object PushProjectThroughSample extends Rule[LogicalPlan] {
--- End diff --

yah. I will update this. At least one optimizer test uses this rule. The 
test should be changed too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r72008673
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer(
 
 /**
  * Pushes projects down beneath Sample to enable column pruning with 
sampling.
+ * This rule is only doable when the projects don't add new attributes.
  */
 object PushProjectThroughSample extends Rule[LogicalPlan] {
--- End diff --

the last case in `ColumnPruning`, it will generate a new `Project` under 
`Sample`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r72008576
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer(
 
 /**
  * Pushes projects down beneath Sample to enable column pruning with 
sampling.
+ * This rule is only doable when the projects don't add new attributes.
  */
 object PushProjectThroughSample extends Rule[LogicalPlan] {
--- End diff --

Not sure which part you mean? I don't see `ColumnPruning` handling `Sample`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r72008303
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer(
 
 /**
  * Pushes projects down beneath Sample to enable column pruning with 
sampling.
+ * This rule is only doable when the projects don't add new attributes.
  */
 object PushProjectThroughSample extends Rule[LogicalPlan] {
--- End diff --

ah, looks like `ColumnPruning` already handles it, can we just remove this 
rule?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r72008210
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer(
 
 /**
  * Pushes projects down beneath Sample to enable column pruning with 
sampling.
+ * This rule is only doable when the projects don't add new attributes.
  */
 object PushProjectThroughSample extends Rule[LogicalPlan] {
--- End diff --

should we merge this rule into `ColumnPruning`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r72002405
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  require(n != 1, "udfOne shouldn't see swid=1!")
+  1
+})
+
+val d = Seq(
+  (0, "string0"),
+  (1, "string1"),
+  (2, "string2"),
+  (3, "string3"),
+  (4, "string4"),
+  (5, "string5"),
+  (6, "string6"),
+  (7, "string7"),
+  (8, "string8"),
+  (9, "string9")
+)
+val df = spark.createDataFrame(d).toDF("swid", "stringData")
+val sampleDF = df.sample(false, 0.7, 50)
+// After sampling, sampleDF doesn't contain swid=1.
+assert(!sampleDF.select("swid").collect.contains(1))
+// udfOne should not encounter swid=1.
+sampleDF.select(udfOne($"swid")).collect
--- End diff --

`Sample` will filter out the returned value of `swid=1`. So I simply call 
`collect` to verify if the exception will be thrown or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71992303
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  require(n != 1, "udfOne shouldn't see swid=1!")
+  1
+})
+
+val d = Seq(
+  (0, "string0"),
+  (1, "string1"),
+  (2, "string2"),
+  (3, "string3"),
+  (4, "string4"),
+  (5, "string5"),
+  (6, "string6"),
+  (7, "string7"),
+  (8, "string8"),
+  (9, "string9")
+)
+val df = spark.createDataFrame(d).toDF("swid", "stringData")
+val sampleDF = df.sample(false, 0.7, 50)
+// After sampling, sampleDF doesn't contain swid=1.
+assert(!sampleDF.select("swid").collect.contains(1))
+// udfOne should not encounter swid=1.
+sampleDF.select(udfOne($"swid")).collect
--- End diff --

I assume you're calling `collect` to trigger `assert`, aren't you? If so, 
why don't you return `true`/`false` to denote it and do `assert` here instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71992270
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
--- End diff --

Is there a reason why you `spark.udf.register` not `udf` directly?

```
val udfOne = udf { n: Int => ... }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71992254
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  require(n != 1, "udfOne shouldn't see swid=1!")
+  1
+})
+
+val d = Seq(
+  (0, "string0"),
+  (1, "string1"),
+  (2, "string2"),
+  (3, "string3"),
+  (4, "string4"),
+  (5, "string5"),
+  (6, "string6"),
+  (7, "string7"),
+  (8, "string8"),
+  (9, "string9")
+)
+val df = spark.createDataFrame(d).toDF("swid", "stringData")
--- End diff --

`d.toDF(...)` should work too, shouldn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-24 Thread jaceklaskowski
Github user jaceklaskowski commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71992242
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ---
@@ -150,13 +150,20 @@ class SimpleTestOptimizer extends Optimizer(
 
 /**
  * Pushes projects down beneath Sample to enable column pruning with 
sampling.
+ * This rule is only doable when the projects don't add new attributes.
  */
 object PushProjectThroughSample extends Rule[LogicalPlan] {
   def apply(plan: LogicalPlan): LogicalPlan = plan transform {
 // Push down projection into sample
-case Project(projectList, Sample(lb, up, replace, seed, child)) =>
+case p @ Project(projectList, Sample(lb, up, replace, seed, child))
+if !hasNewOutput(projectList, p.child.output) =>
   Sample(lb, up, replace, seed, Project(projectList, child))()
   }
+  private def hasNewOutput(
+  projectList: Seq[NamedExpression],
+  childOutput: Seq[Attribute]): Boolean = {
+projectList.exists(p => !childOutput.exists(_.semanticEquals(p)))
--- End diff --

It's hard to understand what the code does -- two `exists` and negation. 
Can you "untangle" it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71972546
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  if (n == 1) {
+throw new RuntimeException("udfOne shouldn't see swid=1!")
--- End diff --

Thanks! I've updated it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-23 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/14327#discussion_r71971233
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala 
---
@@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with 
SharedSQLContext {
   3, 17, 27, 58, 62)
   }
 
+  test("SPARK-16686: Dataset.sample with seed results shouldn't depend on 
downstream usage") {
+val udfOne = spark.udf.register("udfOne", (n: Int) => {
+  if (n == 1) {
+throw new RuntimeException("udfOne shouldn't see swid=1!")
--- End diff --

Use `require`? generally `RuntimeException` isn't used directly. Really 
minor


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...

2016-07-23 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/14327

[SPARK-16686][SQL] Project shouldn't be pushed down through Sample if it 
has new output

## What changes were proposed in this pull request?

We push down `Project` through `Sample` in `Optimizer`. However, if the 
projected columns produce new output, they will encounter whole data instead of 
sampled data. It will bring some inconsistency between original plan (Sample 
then Project) and optimized plan (Project then Sample). In the extreme case 
such as attached in the JIRA, if the projected column is an UDF which is 
supposed to not see the sampled out data, the result of UDF will be incorrect.

We shouldn't push down Project through Sample if the Project brings new 
output.

## How was this patch tested?

Jenkins tests.





You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 fix-sample-pushdown

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14327.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14327


commit 9521a5aca87bead3dcfeabd7abe3468194984ea3
Author: Liang-Chi Hsieh 
Date:   2016-07-23T10:13:07Z

Project shouldn't be pushed down through Sample if it has new output.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org