Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Reynold Xin Fri, 01 Apr 2016 13:58:31 -0700

spark.shuffle.spill actually has nothing to do with whether we write
shuffle files to disk. Currently it is not possible to not write shuffle
files to disk, and typically it is not a problem because the network fetch
throughput is lower than what disks can sustain. In most cases, especially
with SSDs, there is little difference between putting all of those in
memory and on disk.


However, it is becoming more common to run Spark on a few number of beefy
nodes (e.g. 2 nodes each with 1TB of RAM). We do want to look into
improving performance for those. Meantime, you can setup local ramdisks on
each node for shuffle writes.



On Fri, Apr 1, 2016 at 11:32 AM, Michael Slavitch <slavi...@gmail.com>
wrote:

> Hello;
>
> I’m working on spark with very large memory systems (2TB+) and notice that
> Spark spills to disk in shuffle.  Is there a way to force spark to stay in
> memory when doing shuffle operations?   The goal is to keep the shuffle
> data either in the heap or in off-heap memory (in 1.6.x) and never touch
> the IO subsystem.  I am willing to have the job fail if it runs out of RAM.
>
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> Tungsten sort in 1.5.x
>
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this
> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will continue to spill to disk when necessary.”
>
> If this is impossible via configuration changes what code changes would be
> needed to accomplish this?
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Eliminating shuffle write and spill disk IO reads/writes in Spark

Reply via email to