Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19317
Hi @WeichenXu123, any comments on this?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19317
Test case:
```scala
test("performance of aggregateByKeyLocally ") {
val random = new Random(1)
val pairs = sc.parallelize(0 until 1000, 20)
.map(p =>
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
It is better adding more perf test for `OpenHashSet` replacement to avoid
perf regression. And I found `reduceByKeyLocally` also use `JHashSet`, I am not
sure whether there is some special
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19317
OK, just keep it. Does this need more test or more improvements ?
---
-
To unsubscribe, e-mail:
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
@ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to
improve perf when driver total collecting data size from executors are large
and there're many partitions.
But I
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19317
Does not `treeAggregate` will introduce another `Shuffle`?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
Oh I get your point. This is different from `RDD.aggregate`, it directly
return Map and avoid shuffling. it seems useful when numKeys is small.
But, I check the final `reduce` step, it
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19317
@jiangxb1987 ,@WeichenXu123, thanks for your reviewing. This change is
inspired by the `TODO List`. You can see the follow code snippet:
```scala
// TODO: Calling aggregateByKey and
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
Yes. I guess the perf gain is because, this PR use local hashmap which can
use unlimited memory, but current spark aggregation impl, will auto spill local
hashmap when exceeding a threshold.
Github user VinceShieh commented on the issue:
https://github.com/apache/spark/pull/19317
Nice catch. thanks. the perf gain is truly narrow.
I believe this impl just tried to align with the impl of
'reduceByKeyLocally'.
@ConeyLiu maybe we should revisit the code, along with
Github user WeichenXu123 commented on the issue:
https://github.com/apache/spark/pull/19317
And I have to point out that your impl have high risk causing OOM. The
current impl will auto spill when local hashmap is too large and can take
advantage of spark auto memory management
Github user jiangxb1987 commented on the issue:
https://github.com/apache/spark/pull/19317
cc @WeichenXu123 mind take a look?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/19317
Can one of the admins verify this patch?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user ConeyLiu commented on the issue:
https://github.com/apache/spark/pull/19317
cc @VinceShieh
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:
14 matches
Mail list logo