Joren, Anytime there is a shuffle in the network, Spark moves to a new stage. It seems like you are having issues either pre or post shuffle. Have you looked at a resource management tool like ganglia to determine if this is a memory or thread related issue? The spark UI?
You are using groupByKey() have you thought of an alternative like aggregateByKey() or combineByKey() to reduce shuffling? https://umbertogriffo.gitbooks.io/apache-spark-best-practices-and-tuning/content/avoid_groupbykey_when_performing_an_associative_re/avoid-groupbykey-when-performing-a-group-of-multiple-items-by-key.html Dynamic allocation is great; but sometimes I’ve found explicitly setting the num executors, cores per executor, and memory per executor to be a better alternative. Take a look at the yarn logs as well for the particular executor in question. Executors can have multiple tasks; and will often fail if they have more tasks than available threads. As for partitioning the data; you could also look into your level of parallelism which is correlated to the splitablity (blocks) of data. This will be based on your largest RDD. https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism Spark is like C/C++ you need to manage the memory buffer or the compiler will through you out ;) https://spark.apache.org/docs/latest/hardware-provisioning.html Hang in there, this is the more complicated stage of placing a spark application into production. The Yarn logs should point you in the right direction. It’s tough to debug over email, so hopefully this information is helpful. -Pat On 12/28/17, 9:57 AM, "Jeroen Miller" <bluedasya...@gmail.com> wrote: On 28 Dec 2017, at 17:41, Richard Qiao <richardqiao2...@gmail.com> wrote: > Are you able to specify which path of data filled up? I can narrow it down to a bunch of files but it's not so straightforward. > Any logs not rolled over? I have to manually terminate the cluster but there is nothing more in the driver's log when I check it from the AWS console when the cluster is still running. JM --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org