Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15713
Although I'd love to find we can get rid of a custom collections class, it
seems like we can't do that and that the optimization here doesn't result in a
win. Let's close this for now.
---
If your
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15713
@a-roberts feel free to continue investigating, but at some point it's also
important to ask the question: is this micro-optimization worth it? It only
improves 4% for a very specific workload based on
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
Performance results against the Spark master branch on a 48 core machine
running PageRank with 500k pages follow
**Vanilla CompactBuffer, no changes, run time and throughput (bytes per
se
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
In response to @rxin's question, for HiBench CompactBuffers are **used only
on PageRank** (none of the other 11) and these buffers mainly have between 3
and 40 elements, no more than 60, never wit
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
I'll add a print to keep track of how big the CompactBuffer actually gets
when used with, say, SparkSqlPerf (1 scale factor then 2 scale factor) and
HiBench.
Once this is complete and we
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15713
@a-roberts I'm interested in your thoughts on that last point which is a
fair one. If it's as-good for small cases and better for large cases to just
use ArrayBuffer, that's a great win.
---
If you
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15713
So it seems like the benchmark is the worst case for the current
implementation -- it's doing a big groupByKey to potentially add a lot of items
to the array buffer.
Do we have cases that only
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15713
The benchmark sounds fine. I think it would indeed be interesting to see
how a simple ArrayBuffer works, then.
---
If your project is set up for it, you can reply to this email and have your
reply a
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
The benchmark used is HiBench 5 which works with Spark 1.6, I expect there
will be HiBench 6 officially available soon with [Spark 2
support](https://github.com/intel-hadoop/HiBench/pull/337)
Github user rxin commented on the issue:
https://github.com/apache/spark/pull/15713
What is this benchmark? Can you show the query and the explain plan, if it
is in Spark SQL?
The current change doesn't make a lot of sense. If the findings are
correct, we should just use Arra
Github user andrewcraik commented on the issue:
https://github.com/apache/spark/pull/15713
I do agree we should save the extra object allocation, but again whether we
should flatten the array into the class or remove the class and use ArrayBuffer
directly is the question - do we want
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15713
@andrewcraik If we are simply delegating to ArrayBuffer, it makes sense to
remove the class - it is an additional object created for each key being
aggregated.
Alternative would be to implement
Github user andrewcraik commented on the issue:
https://github.com/apache/spark/pull/15713
Having worked with @a-roberts to help develop this change I can see that
the type itself may not be required any more with the change proposed, but
keeping it around would allow us to distinguis
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
Good point, I'll first see if we can build and pass the unit tests by
replacing CompactBuffer with ArrayBuffer[2] and we'd also repeat the profiling
(this would verify if groupByKey is still hot f
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15713
I believe the benchmark, in which case it seems like this class could just
be removed at this point, and replaced with an ArrayBuffer that sets an
appropriately small initial size. Is that possible t
Github user a-roberts commented on the issue:
https://github.com/apache/spark/pull/15713
Thanks for the prompt feedback, we found this opportunity when profiling
Spark 1.6.2 with HiBench large and again this showed up as a hot method with
the PageRank benchmark, we can gather data to
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15713
+CC @rxin who can elaborate better.
The reason iirc why the first two variables are inline is to do with usual
low cardinality of the buffer - it effectively comes for "free" due to padding
over
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15713
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15713
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67898/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15713
**[Test build #67898 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67898/consoleFull)**
for PR 15713 at commit
[`cdf8b8d`](https://github.com/apache/spark/commit/
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/15713
It leads to a bigger memory footprint because of the extra elements that
may be allocated but unused in the array? and the performance win is probably
avoiding those branches? I could believe it.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15713
**[Test build #67898 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67898/consoleFull)**
for PR 15713 at commit
[`cdf8b8d`](https://github.com/apache/spark/commit/c
22 matches
Mail list logo