Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Interestingly, if there is nothing running on dev spark-shell, it recovers
successfully and regains the lost executors. Attaching the log for that.
Notice, the Registering block manager .. statements in the very end after
all executors were lost.

On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Attaching log for when the dev job gets stuck (once all its executors are
 lost due to preemption). This is a spark-shell job running in yarn-client
 mode.

 On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

 We've set up our spark cluster on aws running on yarn (running on hadoop
 2.3) with fair scheduling and preemption turned on. The cluster is shared
 for prod and dev work where prod runs with a higher fair share and can
 preempt dev jobs if there are not enough resources available for it.
 It appears that dev jobs which get preempted often get unstable after
 losing some executors and the whole jobs gets stuck (without making any
 progress) or end up getting restarted (and hence losing all the work done).
 Has someone encountered this before ? Is the solution just to set 
 spark.task.maxFailures
 to a really high value to recover from task failures in such scenarios? Are
 there other approaches that people have taken for spark multi tenancy that
 works better in such scenario?

 Thanks,
 Sadhan





spark_job_recovers.log
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark cluster multi tenancy

2015-08-26 Thread Sadhan Sood
Attaching log for when the dev job gets stuck (once all its executors are
lost due to preemption). This is a spark-shell job running in yarn-client
mode.

On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Hi All,

 We've set up our spark cluster on aws running on yarn (running on hadoop
 2.3) with fair scheduling and preemption turned on. The cluster is shared
 for prod and dev work where prod runs with a higher fair share and can
 preempt dev jobs if there are not enough resources available for it.
 It appears that dev jobs which get preempted often get unstable after
 losing some executors and the whole jobs gets stuck (without making any
 progress) or end up getting restarted (and hence losing all the work done).
 Has someone encountered this before ? Is the solution just to set 
 spark.task.maxFailures
 to a really high value to recover from task failures in such scenarios? Are
 there other approaches that people have taken for spark multi tenancy that
 works better in such scenario?

 Thanks,
 Sadhan



spark_job_stuck.log
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark cluster multi tenancy

2015-08-26 Thread Jerrick Hoang
Would be interested to know the answer too.

On Wed, Aug 26, 2015 at 11:45 AM, Sadhan Sood sadhan.s...@gmail.com wrote:

 Interestingly, if there is nothing running on dev spark-shell, it recovers
 successfully and regains the lost executors. Attaching the log for that.
 Notice, the Registering block manager .. statements in the very end after
 all executors were lost.

 On Wed, Aug 26, 2015 at 11:27 AM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Attaching log for when the dev job gets stuck (once all its executors are
 lost due to preemption). This is a spark-shell job running in yarn-client
 mode.

 On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood sadhan.s...@gmail.com
 wrote:

 Hi All,

 We've set up our spark cluster on aws running on yarn (running on hadoop
 2.3) with fair scheduling and preemption turned on. The cluster is shared
 for prod and dev work where prod runs with a higher fair share and can
 preempt dev jobs if there are not enough resources available for it.
 It appears that dev jobs which get preempted often get unstable after
 losing some executors and the whole jobs gets stuck (without making any
 progress) or end up getting restarted (and hence losing all the work done).
 Has someone encountered this before ? Is the solution just to set 
 spark.task.maxFailures
 to a really high value to recover from task failures in such scenarios? Are
 there other approaches that people have taken for spark multi tenancy that
 works better in such scenario?

 Thanks,
 Sadhan





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org