[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 Yeah! Let me close this for now. We can discuss this further in the Jira ticket then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20806 great! maybe we can hold this PR for a real SQL tree aggregate in the future, with some proper design and discussion. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @cloud-fan @WeichenXu123 Ok. I've setup a Spark cluster with 5 nodes for the benchmark. The used data: ``` val r = new Random val ds = (0 to 1).map { _ => val a = Array.fill(1)(if (r.nextDouble() > 0.5) 1.0 else 0.0 ) Tuple1(Vectors.dense(a)) }.toDS ``` Two versions of `treeAggregate` perform very close. Thus, directly using `RDD.treeAggregate` can be much simpler. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @WeichenXu123 As the discussion with @cloud-fan at https://github.com/apache/spark/pull/20806#discussion_r174277864, I'd like to see some performance gain it, but I need to run benchmark to see if there is any. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 @viirya ok. but there're already a class in ML use `TypedImperativeAggregator`, see `Summarizer`. And do you benchmark and compare this PR and `df.rdd.treeAggregate`? Seems they're almost the same. Is there some difference which can make remarkable performance improvement ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @WeichenXu123 The `seqOp`/`comOp` can be arbitrary and works on domain objects, I'm not sure if built-in agg functions can satisfy all the use of it. For example, it seems hard to express `IDF.DocumentFrequencyAggregator` in built-in agg functions if any. One possible way is to use `Aggregator` and developers can write their aggregation function when doing treeAggregate. One advantage of `seqOp`/`comOp` is that ML developers don't need to learn how to write `Aggregator`. It may let them exposed to some concepts like `Encoder`. I have concerned that ML developer should know this or not. Anyway, to work with built-in agg functions or `Aggregator`, because it uses SQL aggregation system, we may need to overhaul the current aggregation system to support tree-style aggregation. Although it can benefit more situations not just ML, it needs more thinking and design. You can think of this as a workaround for now. Thus it is only for private use. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 But I haven't benchmark. Maybe it do not worth to do codegen for treeAggregate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 @viirya Yes. `treeAggregate` should only apply to global aggregate. But in this PR the API have to use `seqOp`/`combOp`. What I expect is that the dataframe version treeAggregate can exploit built-in agg function (suppose in the future we have built-in agg function for vector type) `dataset.groupBy()` if do not given any key column then it will group the whole dataset so it can match the case of treeAggregate, or do you have some better design ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 @WeichenXu123 I feel `groupBy` is more SQL-like aggregation by which we can specify a key to grouping by. At least `rdd.treeAggregate` does not support key-specified aggregation. For typed grouping `groupByKey`, it constructs `KeyValueGroupedDataset` by which we rely on SQL `Aggregate` execution to grouping data. Currently it doesn't support tree-based aggregation. This work doesn't intend to overhaul SQL aggregation to support tree-based aggregation. So the API will looks more like as is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/20806 The API seems not dataframe style. What I expect is something like: ``` dataset.groupBy().setAggregateLevel(2).agg(Map("age" -> "max", "salary" -> "avg")) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88200/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88200 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88200/testReport)** for PR 20806 at commit [`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88200 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88200/testReport)** for PR 20806 at commit [`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1482/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88194/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88194 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88194/testReport)** for PR 20806 at commit [`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e). * This patch **fails due to an unknown error code, -9**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20806 **[Test build #88194 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88194/testReport)** for PR 20806 at commit [`a254d15`](https://github.com/apache/spark/commit/a254d1501c0119b4881c0443f28c263f0c9dec0e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1478/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20806 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20806: [SPARK-23661][SQL] Implement treeAggregate on Dataset AP...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20806 cc @dbtsai @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org