[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72009385 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- yah. I will update this. At least one optimizer test uses this rule. The test should be changed too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008673 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- the last case in `ColumnPruning`, it will generate a new `Project` under `Sample` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008576 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- Not sure which part you mean? I don't see `ColumnPruning` handling `Sample`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008303 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- ah, looks like `ColumnPruning` already handles it, can we just remove this rule? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72008210 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,11 +150,13 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { --- End diff -- should we merge this rule into `ColumnPruning`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r72002405 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + require(n != 1, "udfOne shouldn't see swid=1!") + 1 +}) + +val d = Seq( + (0, "string0"), + (1, "string1"), + (2, "string2"), + (3, "string3"), + (4, "string4"), + (5, "string5"), + (6, "string6"), + (7, "string7"), + (8, "string8"), + (9, "string9") +) +val df = spark.createDataFrame(d).toDF("swid", "stringData") +val sampleDF = df.sample(false, 0.7, 50) +// After sampling, sampleDF doesn't contain swid=1. +assert(!sampleDF.select("swid").collect.contains(1)) +// udfOne should not encounter swid=1. +sampleDF.select(udfOne($"swid")).collect --- End diff -- `Sample` will filter out the returned value of `swid=1`. So I simply call `collect` to verify if the exception will be thrown or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71992303 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + require(n != 1, "udfOne shouldn't see swid=1!") + 1 +}) + +val d = Seq( + (0, "string0"), + (1, "string1"), + (2, "string2"), + (3, "string3"), + (4, "string4"), + (5, "string5"), + (6, "string6"), + (7, "string7"), + (8, "string8"), + (9, "string9") +) +val df = spark.createDataFrame(d).toDF("swid", "stringData") +val sampleDF = df.sample(false, 0.7, 50) +// After sampling, sampleDF doesn't contain swid=1. +assert(!sampleDF.select("swid").collect.contains(1)) +// udfOne should not encounter swid=1. +sampleDF.select(udfOne($"swid")).collect --- End diff -- I assume you're calling `collect` to trigger `assert`, aren't you? If so, why don't you return `true`/`false` to denote it and do `assert` here instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71992270 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { --- End diff -- Is there a reason why you `spark.udf.register` not `udf` directly? ``` val udfOne = udf { n: Int => ... } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71992254 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,32 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + require(n != 1, "udfOne shouldn't see swid=1!") + 1 +}) + +val d = Seq( + (0, "string0"), + (1, "string1"), + (2, "string2"), + (3, "string3"), + (4, "string4"), + (5, "string5"), + (6, "string6"), + (7, "string7"), + (8, "string8"), + (9, "string9") +) +val df = spark.createDataFrame(d).toDF("swid", "stringData") --- End diff -- `d.toDF(...)` should work too, shouldn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user jaceklaskowski commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71992242 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -150,13 +150,20 @@ class SimpleTestOptimizer extends Optimizer( /** * Pushes projects down beneath Sample to enable column pruning with sampling. + * This rule is only doable when the projects don't add new attributes. */ object PushProjectThroughSample extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { // Push down projection into sample -case Project(projectList, Sample(lb, up, replace, seed, child)) => +case p @ Project(projectList, Sample(lb, up, replace, seed, child)) +if !hasNewOutput(projectList, p.child.output) => Sample(lb, up, replace, seed, Project(projectList, child))() } + private def hasNewOutput( + projectList: Seq[NamedExpression], + childOutput: Seq[Attribute]): Boolean = { +projectList.exists(p => !childOutput.exists(_.semanticEquals(p))) --- End diff -- It's hard to understand what the code does -- two `exists` and negation. Can you "untangle" it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71972546 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + if (n == 1) { +throw new RuntimeException("udfOne shouldn't see swid=1!") --- End diff -- Thanks! I've updated it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/14327#discussion_r71971233 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala --- @@ -422,6 +422,35 @@ class DatasetSuite extends QueryTest with SharedSQLContext { 3, 17, 27, 58, 62) } + test("SPARK-16686: Dataset.sample with seed results shouldn't depend on downstream usage") { +val udfOne = spark.udf.register("udfOne", (n: Int) => { + if (n == 1) { +throw new RuntimeException("udfOne shouldn't see swid=1!") --- End diff -- Use `require`? generally `RuntimeException` isn't used directly. Really minor --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14327: [SPARK-16686][SQL] Project shouldn't be pushed do...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/14327 [SPARK-16686][SQL] Project shouldn't be pushed down through Sample if it has new output ## What changes were proposed in this pull request? We push down `Project` through `Sample` in `Optimizer`. However, if the projected columns produce new output, they will encounter whole data instead of sampled data. It will bring some inconsistency between original plan (Sample then Project) and optimized plan (Project then Sample). In the extreme case such as attached in the JIRA, if the projected column is an UDF which is supposed to not see the sampled out data, the result of UDF will be incorrect. We shouldn't push down Project through Sample if the Project brings new output. ## How was this patch tested? Jenkins tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 fix-sample-pushdown Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14327 commit 9521a5aca87bead3dcfeabd7abe3468194984ea3 Author: Liang-Chi Hsieh Date: 2016-07-23T10:13:07Z Project shouldn't be pushed down through Sample if it has new output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org