running 2 spark applications in parallel on yarn

2015-02-01 Thread Tomer Benyamini
Hi all,

I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
whenever I'm running a heavy computation job in parallel to other jobs
running, I'm getting these kind of exceptions:

* [task-result-getter-2] INFO  org.apache.spark.scheduler.TaskSetManager-
Lost task 820.0 in stage 175.0 (TID 11327) on executor xxx:
java.io.IOException (Failed to connect to xx:35194) [duplicate 12]

* org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
location for shuffle 12

* org.apache.spark.shuffle.FetchFailedException: Failed to connect to
x:35194
Caused by: java.io.IOException: Failed to connect to x:35194

when running the heavy job alone on the cluster, I'm not getting any
errors. My guess is that spark contexts from different apps do not share
information about taken ports, and therefore collide on specific ports,
causing the job/stage to fail. Is there a way to assign a specific set of
executors to a specific spark job via spark-submit, or is there a way to
define a range of ports to be used by the application?

Thanks!
Tomer


Re: running 2 spark applications in parallel on yarn

2015-02-01 Thread Sandy Ryza
Hi Tomer,

Are you able to look in your NodeManager logs to see if the NodeManagers
are killing any executors for exceeding memory limits?  If you observe
this, you can solve the problem by bumping up
spark.yarn.executor.memoryOverhead.

-Sandy

On Sun, Feb 1, 2015 at 5:28 AM, Tomer Benyamini tomer@gmail.com wrote:

 Hi all,

 I'm running spark 1.2.0 on a 20-node Yarn emr cluster. I've noticed that
 whenever I'm running a heavy computation job in parallel to other jobs
 running, I'm getting these kind of exceptions:

 * [task-result-getter-2] INFO  org.apache.spark.scheduler.TaskSetManager-
 Lost task 820.0 in stage 175.0 (TID 11327) on executor xxx:
 java.io.IOException (Failed to connect to xx:35194) [duplicate 12]

 * org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
 location for shuffle 12

 * org.apache.spark.shuffle.FetchFailedException: Failed to connect to
 x:35194
 Caused by: java.io.IOException: Failed to connect
 to x:35194

 when running the heavy job alone on the cluster, I'm not getting any
 errors. My guess is that spark contexts from different apps do not share
 information about taken ports, and therefore collide on specific ports,
 causing the job/stage to fail. Is there a way to assign a specific set of
 executors to a specific spark job via spark-submit, or is there a way to
 define a range of ports to be used by the application?

 Thanks!
 Tomer