
Anytime there is a shuffle in the network, Spark moves to a new stage. It seems 
like you are having issues either pre or post shuffle. Have you looked at a 
resource management tool like ganglia to determine if this is a memory or 
thread related issue? The spark UI?

You are using groupByKey() have you thought of an alternative like 
aggregateByKey() or combineByKey() to reduce shuffling?

Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
num executors, cores per executor, and memory per executor to be a better 

Take a look at the yarn logs as well for the particular executor in question. 
Executors can have multiple tasks; and will often fail if they have more tasks 
than available threads.

As for partitioning the data; you could also look into your level of 
parallelism which is correlated to the splitablity (blocks) of data. This will 
be based on your largest RDD.

Spark is like C/C++ you need to manage the memory buffer or the compiler will 
through you out  ;)

Hang in there, this is the more complicated stage of placing a spark 
application into production. The Yarn logs should point you in the right 

It’s tough to debug over email, so hopefully this information is helpful.


On 12/28/17, 9:57 AM, "Jeroen Miller" <> wrote:

    On 28 Dec 2017, at 17:41, Richard Qiao <> wrote:
    > Are you able to specify which path of data filled up?
    I can narrow it down to a bunch of files but it's not so straightforward.
    > Any logs not rolled over?
    I have to manually terminate the cluster but there is nothing more in the 
driver's log when I check it from the AWS console when the cluster is still 
    To unsubscribe e-mail:

To unsubscribe e-mail:

Reply via email to