[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 thanks @cloud-fan and @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22364 @mgaido91 well done! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22364 LGTM, merging to master! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 @cloud-fan @dongjoon-hyun @gatorsmile any luck with this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 I also run on the TPCDS and TPCH benchmark with 10 runs: Rule | Effective After | Effective Before | Total After | Total Before | % Eff | % Total -- | -- | -- | -- | -- | -- | -- org.apache.spark.sql.catalyst.optimizer.ColumnPruning (TPCDS) | 8492237639 | 9927405142 | 37554775945 | 44771729889 | 85.5433773 | 83.8805560 org.apache.spark.sql.catalyst.optimizer.ColumnPruning (TPCH) | 214235083 | 292912646 | 933477165 | 1166285066 | 73.1395813 | 80.0385079 -- | -- | -- | -- | -- | -- | -- Which confirm a ~20% improvement after the change (with wide-column datasets, the improvement is much higher as showed by the benchmark referenced in the PR description). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22364 also, cc: @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 thanks @maropu for your review! @gatorsmile do you have any comments? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22364 Basically, this change looks good to me. I leave this to other reviewers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22364 oh, yea, thanks! I wrongly mixed up `(AttributeSet -- Seq[Attribute]).nonEmpty` with this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 @maropu I have run the following benchmark: ``` test("AttributeSet -- benchmark") { val attrSetA = AttributeSet((1 to 100).map { i => AttributeReference(s"c$i", IntegerType)() }) val attrSetB = attrSetA.take(80).toSeq val attrSetC = (1 to 100).map { i => AttributeReference(s"c2_$i", IntegerType)() } val attrSetD = (attrSetA.take(50) ++ attrSetC.take(50)).toSeq val attrSetE = attrSetC.take(50) ++ attrSetA.take(50) val n_iter = 100 val t0 = System.nanoTime() (1 to n_iter) foreach { _ => val r1 = attrSetA -- attrSetB val r2 = attrSetA -- attrSetC val r3 = attrSetA -- attrSetD val r4 = attrSetA -- attrSetE } val t1 = System.nanoTime() (1 to n_iter) foreach { _ => val r1 = attrSetA subsetOf AttributeSet(attrSetB) val r2 = attrSetA subsetOf AttributeSet(attrSetC) val r3 = attrSetA subsetOf AttributeSet(attrSetD) val r4 = attrSetA subsetOf AttributeSet(attrSetE) } val t2 = System.nanoTime() val totalTime1 = t1 - t0 val totalTime2 = t2 - t1 println(s"Average time for --: ${totalTime1 / n_iter} us") println(s"Average time for subsetOf: ${totalTime2 / n_iter} us") } ``` And the output is: ``` Average time for --: 25065 us Average time for subsetOf: 108638 us ``` So for the case you mentioned, using `subsetOf` would instead introduce a performance regression. I have also run all the tests in StarJoinCostBasedReorderSuite for 1000 times and the perf regression was confirmed: ``` Running StarJoinCostBasedReorderSuite's tests 1000 times takes w/o change: 68877186927 us Running StarJoinCostBasedReorderSuite's tests 1000 times takes with change: 70689955856 us ``` The point is that there we have a `Seq[Attribute]` instead of an `AttributeSet` as parameter. Hope this is clear, let me know otherwise. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 @maropu anyway I checked and that is the only other places where this pattern happens. So I am ok including it here. The point is that there the situation is a bit different, ie. it is not an `(AttributeSet -- AttributeSet).nonEmpty` case, but it is a `(AttributeSet -- Seq[Attribute]).nonEmpty`. So I am not sure the performance gain/difference will be significant in this case. I'll try and do some benchmarks and I'll do the change if I see a significant difference. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22364 IIUC this pr targets to improve `AttributeSet` operations, so all the places get the same luck with `ColumnPruning`? If so, I think its ok to fix all the places in this pr. cc: @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 @maropu yes, that can be done as well, but I think the main focus of this PR is the `ColumnPruning` rule, so I think it would be great to do that in a separate PR. What do you think? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/22364 Can we replace the syntax(`(ouputSetA -- outputSetB).nonEmpty`) in other places, too? e.g., https://github.com/apache/spark/blob/9deddbb13edebfefb3fd03f063679ed12e73c575/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala#L294 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 anymore comments @maropu @gatorsmile ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95957/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95957 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95957/testReport)** for PR 22364 at commit [`6aeda5f`](https://github.com/apache/spark/commit/6aeda5f9c44918a3361b59594faf2b345c9dac33). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95957 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95957/testReport)** for PR 22364 at commit [`6aeda5f`](https://github.com/apache/spark/commit/6aeda5f9c44918a3361b59594faf2b345c9dac33). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3023/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95944/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95944 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95944/testReport)** for PR 22364 at commit [`6aeda5f`](https://github.com/apache/spark/commit/6aeda5f9c44918a3361b59594faf2b345c9dac33). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95944 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95944/testReport)** for PR 22364 at commit [`6aeda5f`](https://github.com/apache/spark/commit/6aeda5f9c44918a3361b59594faf2b345c9dac33). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3014/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95869/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95869/testReport)** for PR 22364 at commit [`2afbe9b`](https://github.com/apache/spark/commit/2afbe9ba080b7807db3d7b22e42611f629926fd3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2981/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95869/testReport)** for PR 22364 at commit [`2afbe9b`](https://github.com/apache/spark/commit/2afbe9ba080b7807db3d7b22e42611f629926fd3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22364 cc @gatorsmile @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95828/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95828/testReport)** for PR 22364 at commit [`14edbe6`](https://github.com/apache/spark/commit/14edbe6a2fe8fab7131777302024b47ed19da513). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2944/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22364 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22364: [SPARK-25379][SQL] Improve AttributeSet and ColumnPrunin...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22364 **[Test build #95828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95828/testReport)** for PR 22364 at commit [`14edbe6`](https://github.com/apache/spark/commit/14edbe6a2fe8fab7131777302024b47ed19da513). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org