In PySpark, the data processed by each reduce task needs to fit in memory 
within the Python process, so you should use more tasks to process this 
dataset. Data is spilled to disk across tasks.

I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this — 
it’s something we’ve been meaning to look at soon.

Matei

On Jun 4, 2014, at 8:23 AM, Brad Miller <bmill...@eecs.berkeley.edu> wrote:

> Hi All,
> 
> I have experienced some crashing behavior with join in pyspark.  When I 
> attempt a join with 2000 partitions in the result, the join succeeds, but 
> when I use only 200 partitions in the result, the join fails with the message 
> "Job aborted due to stage failure: Master removed our application: FAILED".
> 
> The crash always occurs at the beginning of the shuffle phase.  Based on my 
> observations, it seems like the workers in the read phase may be fetching 
> entire blocks from the write phase of the shuffle rather than just the 
> records necessary to compose the partition the reader is responsible for.  
> Hence, when there are fewer partitions in the read phase, the worker is 
> likely to need a record from each of the write partitions and consequently 
> attempts to load the entire data set into the memory of a single machine 
> (which then causes the out of memory crash I observe in /var/log/syslog).
> 
> Can anybody confirm if this is the behavior of pyspark?  I am glad to supply 
> additional details about my observed behavior upon request.
> 
> best,
> -Brad

Reply via email to