Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Eric Walker
Hi Richard,

I've stepped away from this issue since I raised my question.  An
additional detail that was unknown at the time was that not in every
instance when the spilling to disk was encountered did the application run
out of disk space; that problem appears to have been a one-off problem.
The main challenge was that the spark.shuffle.spill setting seemed to be
ignored.  This might have been the expected behavior given the skew that
was in the data.

More generally, attempts to tweak the application behavior using such
settings as spark.python.worker.memory and spark.shuffle.memoryFraction had
no observable effect.  It is possible that the ignoring of
the spark.shuffle.spill setting was just a manifestation of a larger issue
going back to a misconfiguration.

Eric


On Wed, Sep 9, 2015 at 4:48 PM, Richard Marscher 
wrote:

> Hi Eric,
>
> I just wanted to do a sanity check, do you know what paths it is trying to
> write to? I ask because even without spilling, shuffles always write to
> disk first before transferring data across the network. I had at one point
> encountered this myself where we accidentally had /tmp mounted on a tiny
> disk and kept running out of disk on shuffles even though we also don't
> spill. You may have already considered or ruled this out though.
>
> On Thu, Sep 3, 2015 at 12:56 PM, Eric Walker 
> wrote:
>
>> Hi,
>>
>> I am using Spark 1.3.1 on EMR with lots of memory.  I have attempted to
>> run a large pyspark job several times, specifying
>> `spark.shuffle.spill=false` in different ways.  It seems that the setting
>> is ignored, at least partially, and some of the tasks start spilling large
>> amounts of data to disk.  The job has been fast enough in the past, but
>> once it starts spilling to disk it lands on Miller's planet [1].
>>
>> Is this expected behavior?  Is it a misconfiguration on my part, e.g.,
>> could there be an incompatible setting that is overriding
>> `spark.shuffle.spill=false`?  Is it something that goes back to Spark
>> 1.3.1?  Is it something that goes back to EMR?  When I've allowed the job
>> to continue on for a while, I've started to see Kryo stack traces in the
>> tasks that are spilling to disk.  The stack traces mention there not being
>> enough disk space, although a `df` shows plenty of space (perhaps after the
>> fact, when temporary files have been cleaned up).
>>
>> Has anyone run into something like this before?  I would be happy to see
>> OOM errors, because that would be consistent with one understanding of what
>> might be going on, but I haven't yet.
>>
>> Eric
>>
>>
>> [1] https://www.youtube.com/watch?v=v7OVqXm7_Pk&safe=active
>>
>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook  | LinkedIn
> 
>


Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Richard Marscher
Hi Eric,

I just wanted to do a sanity check, do you know what paths it is trying to
write to? I ask because even without spilling, shuffles always write to
disk first before transferring data across the network. I had at one point
encountered this myself where we accidentally had /tmp mounted on a tiny
disk and kept running out of disk on shuffles even though we also don't
spill. You may have already considered or ruled this out though.

On Thu, Sep 3, 2015 at 12:56 PM, Eric Walker  wrote:

> Hi,
>
> I am using Spark 1.3.1 on EMR with lots of memory.  I have attempted to
> run a large pyspark job several times, specifying
> `spark.shuffle.spill=false` in different ways.  It seems that the setting
> is ignored, at least partially, and some of the tasks start spilling large
> amounts of data to disk.  The job has been fast enough in the past, but
> once it starts spilling to disk it lands on Miller's planet [1].
>
> Is this expected behavior?  Is it a misconfiguration on my part, e.g.,
> could there be an incompatible setting that is overriding
> `spark.shuffle.spill=false`?  Is it something that goes back to Spark
> 1.3.1?  Is it something that goes back to EMR?  When I've allowed the job
> to continue on for a while, I've started to see Kryo stack traces in the
> tasks that are spilling to disk.  The stack traces mention there not being
> enough disk space, although a `df` shows plenty of space (perhaps after the
> fact, when temporary files have been cleaned up).
>
> Has anyone run into something like this before?  I would be happy to see
> OOM errors, because that would be consistent with one understanding of what
> might be going on, but I haven't yet.
>
> Eric
>
>
> [1] https://www.youtube.com/watch?v=v7OVqXm7_Pk&safe=active
>



-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn



spark.shuffle.spill=false ignored?

2015-09-03 Thread Eric Walker
Hi,

I am using Spark 1.3.1 on EMR with lots of memory.  I have attempted to run
a large pyspark job several times, specifying `spark.shuffle.spill=false`
in different ways.  It seems that the setting is ignored, at least
partially, and some of the tasks start spilling large amounts of data to
disk.  The job has been fast enough in the past, but once it starts
spilling to disk it lands on Miller's planet [1].

Is this expected behavior?  Is it a misconfiguration on my part, e.g.,
could there be an incompatible setting that is overriding
`spark.shuffle.spill=false`?  Is it something that goes back to Spark
1.3.1?  Is it something that goes back to EMR?  When I've allowed the job
to continue on for a while, I've started to see Kryo stack traces in the
tasks that are spilling to disk.  The stack traces mention there not being
enough disk space, although a `df` shows plenty of space (perhaps after the
fact, when temporary files have been cleaned up).

Has anyone run into something like this before?  I would be happy to see
OOM errors, because that would be consistent with one understanding of what
might be going on, but I haven't yet.

Eric


[1] https://www.youtube.com/watch?v=v7OVqXm7_Pk&safe=active