The amplab spark internals talk you mentioned is actually referring to the
RDD persistence levels, where by default we do not persist RDDs to disk (
https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#rdd-persistence
).

"spark.shuffle.spill" refers to a different behavior -- if the "reduce"
phase of your shuffle would otherwise cause Spark to OOM, it will instead
write data to temporary files on disk. You probably don't want to disable
this unless you'd prefer to tune Spark to make sure the reduce can stay in
memory.

Note that if your goal is to force Spark never to use disk, it is
complicated by the fact that shuffles always write data to disk in an
analogous way to the shuffle between the map and reduce phases of
MapReduce. You would have to use a ramdisk for Spark's local directory.


On Wed, Mar 12, 2014 at 6:22 PM, Fabrizio Milo aka misto <
mistob...@gmail.com> wrote:

> Hello everyone
>
> I have a question about Shuffle Spills. From the introduction to
> amplab spark internals
> each task output could be saved to disk for 'redundancy'
>
> if I set spark.shuffle.spill to false would this behavior be
> eliminated and make it in a way that it will never spill to disk ?
>
> Thank you
>
> --
> LinkedIn: http://linkedin.com/in/fmilo
> Twitter: @fabmilo
> Github: http://github.com/Mistobaan/
> -----------------------
> Simplicity, consistency, and repetition - that's how you get through.
> (Jack Welch)
> Perfection must be reached by degrees; she requires the slow hand of
> time (Voltaire)
> The best way to predict the future is to invent it (Alan Kay)
>

Reply via email to