Joren,

Anytime there is a shuffle in the network, Spark moves to a new stage. It seems 
like you are having issues either pre or post shuffle. Have you looked at a 
resource management tool like ganglia to determine if this is a memory or 
thread related issue? The spark UI?

You are using groupByKey() have you thought of an alternative like 
aggregateByKey() or combineByKey() to reduce shuffling?
https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/avoid-groupbykey-when-performing-a-group-of-multiple-items-by-key.html

Dynamic allocation is great; but sometimes I’ve found explicitly setting the 
num executors, cores per executor, and memory per executor to be a better 
alternative.

Take a look at the yarn logs as well for the particular executor in question. 
Executors can have multiple tasks; and will often fail if they have more tasks 
than available threads.

As for partitioning the data; you could also look into your level of 
parallelism which is correlated to the splitablity (blocks) of data. This will 
be based on your largest RDD.
https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism

Spark is like C/C++ you need to manage the memory buffer or the compiler will 
through you out  ;)
https://spark.apache.org/docs/latest/hardware-provisioning.html

Hang in there, this is the more complicated stage of placing a spark 
application into production. The Yarn logs should point you in the right 
direction.

It’s tough to debug over email, so hopefully this information is helpful.

-Pat


On 12/28/17, 9:57 AM, "Jeroen Miller" <bluedasya...@gmail.com> wrote:

    On 28 Dec 2017, at 17:41, Richard Qiao <richardqiao2...@gmail.com> wrote:
    > Are you able to specify which path of data filled up?
    
    I can narrow it down to a bunch of files but it's not so straightforward.
    
    > Any logs not rolled over?
    
    I have to manually terminate the cluster but there is nothing more in the 
driver's log when I check it from the AWS console when the cluster is still 
running. 
    
    JM
    
    
    ---------------------------------------------------------------------
    To unsubscribe e-mail: user-unsubscr...@spark.apache.org
    
    


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to