Hey Corey,
Yes, when shuffles are smaller than available memory to the OS, most
often the outputs never get stored to disk. I believe this holds same
for the YARN shuffle service, because the write path is actually the
same, i.e. we don't fsync the writes and force them to disk. I would
guess in
I would guess in such shuffles the bottleneck is serializing the data
rather than raw IO, so I'm not sure explicitly buffering the data in the
JVM process would yield a large improvement.
Good guess! It is very hard to beat the performance of retrieving shuffle
outputs from the OS buffer
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
There's a discussion of this at https://github.com/apache/spark/pull/5403
On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote:
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
In many cases the shuffle will actually hit the OS buffer cache and
not ever touch spinning disk if it is a size that is less than memory
on the machine.
- Patrick
On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote:
So with this... to help my understanding of Spark under the
So with this... to help my understanding of Spark under the hood-
Is this statement correct When data needs to pass between multiple JVMs, a
shuffle will *always* hit disk?
On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote:
There's a discussion of this at
Ok so it is the case that small shuffles can be done without hitting any
disk. Is this the same case for the aux shuffle service in yarn? Can that
be done without hitting disk?
On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote:
In many cases the shuffle will actually hit