[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-10-15 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 Hi @WeichenXu123, any comments on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 Test case: ```scala test("performance of aggregateByKeyLocally ") { val random = new Random(1) val pairs = sc.parallelize(0 until 1000, 20) .map(p =>

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 It is better adding more perf test for `OpenHashSet` replacement to avoid perf regression. And I found `reduceByKeyLocally` also use `JHashSet`, I am not sure whether there is some special

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 OK, just keep it. Does this need more test or more improvements ? --- - To unsubscribe, e-mail:

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-24 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 @ConeyLiu Yes tree aggregate introduce extra shuffle. But it is possible to improve perf when driver total collecting data size from executors are large and there're many partitions. But I

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-23 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 Does not `treeAggregate` will introduce another `Shuffle`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 Oh I get your point. This is different from `RDD.aggregate`, it directly return Map and avoid shuffling. it seems useful when numKeys is small. But, I check the final `reduce` step, it

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 @jiangxb1987 ,@WeichenXu123, thanks for your reviewing. This change is inspired by the `TODO List`. You can see the follow code snippet: ```scala // TODO: Calling aggregateByKey and

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 Yes. I guess the perf gain is because, this PR use local hashmap which can use unlimited memory, but current spark aggregation impl, will auto spill local hashmap when exceeding a threshold.

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread VinceShieh
Github user VinceShieh commented on the issue: https://github.com/apache/spark/pull/19317 Nice catch. thanks. the perf gain is truly narrow. I believe this impl just tried to align with the impl of 'reduceByKeyLocally'. @ConeyLiu maybe we should revisit the code, along with

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread WeichenXu123
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19317 And I have to point out that your impl have high risk causing OOM. The current impl will auto spill when local hashmap is too large and can take advantage of spark auto memory management

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-22 Thread jiangxb1987
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/19317 cc @WeichenXu123 mind take a look? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19317 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19317: [SPARK-22098][CORE] Add new method aggregateByKeyLocally...

2017-09-21 Thread ConeyLiu
Github user ConeyLiu commented on the issue: https://github.com/apache/spark/pull/19317 cc @VinceShieh --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: