Re: Fully in-memory shuffles

2015-06-11 Thread Patrick Wendell
Hey Corey, Yes, when shuffles are smaller than available memory to the OS, most often the outputs never get stored to disk. I believe this holds same for the YARN shuffle service, because the write path is actually the same, i.e. we don't fsync the writes and force them to disk. I would guess in

Re: Fully in-memory shuffles

2015-06-11 Thread Mark Hamstra
I would guess in such shuffles the bottleneck is serializing the data rather than raw IO, so I'm not sure explicitly buffering the data in the JVM process would yield a large improvement. Good guess! It is very hard to beat the performance of retrieving shuffle outputs from the OS buffer

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote: So with this... to help my understanding of Spark under the

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
So with this... to help my understanding of Spark under the hood- Is this statement correct When data needs to pass between multiple JVMs, a shuffle will *always* hit disk? On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen rosenvi...@gmail.com wrote: There's a discussion of this at

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Ok so it is the case that small shuffles can be done without hitting any disk. Is this the same case for the aux shuffle service in yarn? Can that be done without hitting disk? On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell pwend...@gmail.com wrote: In many cases the shuffle will actually hit