[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-12-11 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15713 Although I'd love to find we can get rid of a custom collections class, it seems like we can't do that and that the optimization here doesn't result in a win. Let's close this for now. --- If your

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-12-01 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15713 @a-roberts feel free to continue investigating, but at some point it's also important to ask the question: is this micro-optimization worth it? It only improves 4% for a very specific workload based on

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-12-01 Thread a-roberts
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 Performance results against the Spark master branch on a 48 core machine running PageRank with 500k pages follow **Vanilla CompactBuffer, no changes, run time and throughput (bytes per se

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-30 Thread a-roberts
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 In response to @rxin's question, for HiBench CompactBuffers are **used only on PageRank** (none of the other 11) and these buffers mainly have between 3 and 40 elements, no more than 60, never wit

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-28 Thread a-roberts
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 I'll add a print to keep track of how big the CompactBuffer actually gets when used with, say, SparkSqlPerf (1 scale factor then 2 scale factor) and HiBench. Once this is complete and we

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-21 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15713 @a-roberts I'm interested in your thoughts on that last point which is a fair one. If it's as-good for small cases and better for large cases to just use ArrayBuffer, that's a great win. --- If you

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-07 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15713 So it seems like the benchmark is the worst case for the current implementation -- it's doing a big groupByKey to potentially add a lot of items to the array buffer. Do we have cases that only

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-07 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15713 The benchmark sounds fine. I think it would indeed be interesting to see how a simple ArrayBuffer works, then. --- If your project is set up for it, you can reply to this email and have your reply a

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-03 Thread a-roberts
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 The benchmark used is HiBench 5 which works with Spark 1.6, I expect there will be HiBench 6 officially available soon with [Spark 2 support](https://github.com/intel-hadoop/HiBench/pull/337)

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread rxin
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15713 What is this benchmark? Can you show the query and the explain plan, if it is in Spark SQL? The current change doesn't make a lot of sense. If the findings are correct, we should just use Arra

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread andrewcraik
Github user andrewcraik commented on the issue: https://github.com/apache/spark/pull/15713 I do agree we should save the extra object allocation, but again whether we should flatten the array into the class or remove the class and use ArrayBuffer directly is the question - do we want

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15713 @andrewcraik If we are simply delegating to ArrayBuffer, it makes sense to remove the class - it is an additional object created for each key being aggregated. Alternative would be to implement

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread andrewcraik
Github user andrewcraik commented on the issue: https://github.com/apache/spark/pull/15713 Having worked with @a-roberts to help develop this change I can see that the type itself may not be required any more with the change proposed, but keeping it around would allow us to distinguis

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 Good point, I'll first see if we can build and pass the unit tests by replacing CompactBuffer with ArrayBuffer[2] and we'd also repeat the profiling (this would verify if groupByKey is still hot f

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15713 I believe the benchmark, in which case it seems like this class could just be removed at this point, and replaced with an ArrayBuffer that sets an appropriately small initial size. Is that possible t

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-02 Thread a-roberts
Github user a-roberts commented on the issue: https://github.com/apache/spark/pull/15713 Thanks for the prompt feedback, we found this opportunity when profiling Spark 1.6.2 with HiBench large and again this showed up as a hot method with the PageRank benchmark, we can gather data to

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-01 Thread mridulm
Github user mridulm commented on the issue: https://github.com/apache/spark/pull/15713 +CC @rxin who can elaborate better. The reason iirc why the first two variables are inline is to do with usual low cardinality of the buffer - it effectively comes for "free" due to padding over

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15713 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature e

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15713 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67898/ Test FAILed. ---

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15713 **[Test build #67898 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67898/consoleFull)** for PR 15713 at commit [`cdf8b8d`](https://github.com/apache/spark/commit/

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-01 Thread srowen
Github user srowen commented on the issue: https://github.com/apache/spark/pull/15713 It leads to a bigger memory footprint because of the extra elements that may be allocated but unused in the array? and the performance win is probably avoiding those branches? I could believe it.

[GitHub] spark issue #15713: [SPARK-18196] [CORE] Optimise CompactBuffer implementati...

2016-11-01 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15713 **[Test build #67898 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67898/consoleFull)** for PR 15713 at commit [`cdf8b8d`](https://github.com/apache/spark/commit/c