Hi Richard,

I've stepped away from this issue since I raised my question.  An
additional detail that was unknown at the time was that not in every
instance when the spilling to disk was encountered did the application run
out of disk space; that problem appears to have been a one-off problem.
The main challenge was that the spark.shuffle.spill setting seemed to be
ignored.  This might have been the expected behavior given the skew that
was in the data.

More generally, attempts to tweak the application behavior using such
settings as spark.python.worker.memory and spark.shuffle.memoryFraction had
no observable effect.  It is possible that the ignoring of
the spark.shuffle.spill setting was just a manifestation of a larger issue
going back to a misconfiguration.

Eric


On Wed, Sep 9, 2015 at 4:48 PM, Richard Marscher <rmarsc...@localytics.com>
wrote:

> Hi Eric,
>
> I just wanted to do a sanity check, do you know what paths it is trying to
> write to? I ask because even without spilling, shuffles always write to
> disk first before transferring data across the network. I had at one point
> encountered this myself where we accidentally had /tmp mounted on a tiny
> disk and kept running out of disk on shuffles even though we also don't
> spill. You may have already considered or ruled this out though.
>
> On Thu, Sep 3, 2015 at 12:56 PM, Eric Walker <eric.wal...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am using Spark 1.3.1 on EMR with lots of memory.  I have attempted to
>> run a large pyspark job several times, specifying
>> `spark.shuffle.spill=false` in different ways.  It seems that the setting
>> is ignored, at least partially, and some of the tasks start spilling large
>> amounts of data to disk.  The job has been fast enough in the past, but
>> once it starts spilling to disk it lands on Miller's planet [1].
>>
>> Is this expected behavior?  Is it a misconfiguration on my part, e.g.,
>> could there be an incompatible setting that is overriding
>> `spark.shuffle.spill=false`?  Is it something that goes back to Spark
>> 1.3.1?  Is it something that goes back to EMR?  When I've allowed the job
>> to continue on for a while, I've started to see Kryo stack traces in the
>> tasks that are spilling to disk.  The stack traces mention there not being
>> enough disk space, although a `df` shows plenty of space (perhaps after the
>> fact, when temporary files have been cleaned up).
>>
>> Has anyone run into something like this before?  I would be happy to see
>> OOM errors, because that would be consistent with one understanding of what
>> might be going on, but I haven't yet.
>>
>> Eric
>>
>>
>> [1] https://www.youtube.com/watch?v=v7OVqXm7_Pk&safe=active
>>
>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Reply via email to