Re: trying to understand job cancellation

2014-03-19 Thread Koert Kuipers
on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.


On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers  wrote:

> its 0.9 snapshot from january running in standalone mode.
>
> have these fixed been merged into 0.9?
>
>
> On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia wrote:
>
>> Which version of Spark is this in, Koert? There might have been some
>> fixes more recently for it.
>>
>> Matei
>>
>> On Mar 5, 2014, at 5:26 PM, Koert Kuipers  wrote:
>>
>> Sorry I meant to say: seems the issue is shared RDDs between a job that
>> got cancelled and a later job.
>>
>> However even disregarding that I have the other issue that the active
>> task of the cancelled job hangs around forever, not doing anything
>> On Mar 5, 2014 7:29 PM, "Koert Kuipers"  wrote:
>>
>>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>>
>>> so it seems the issue is the cached RDDs that are ahred between the
>>> cancelled job and the jobs after that.
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers  wrote:
>>>
 well, the new jobs use existing RDDs that were also used in the jon
 that got killed.

 let me confirm that new jobs that use completely different RDDs do not
 get killed.



 On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi 
 wrote:

> Quite unlikely as jobid are given in an incremental fashion, so your
> future jobid are not likely to be killed if your groupid is not repeated.I
> guess the issue is something else.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers wrote:
>
>> i did that. my next job gets a random new group job id (a uuid).
>> however that doesnt seem to stop the job from getting sucked into the
>> cancellation it seems
>>
>>
>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <
>> mayur.rust...@gmail.com> wrote:
>>
>>> You can randomize job groups as well. to secure yourself against
>>> termination.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers wrote:
>>>
 got it. seems like i better stay away from this feature for now..


 On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
 mayur.rust...@gmail.com> wrote:

> One issue is that job cancellation is posted on eventloop. So its
> possible that subsequent jobs submitted to job queue may beat the job
> cancellation event & hence the job cancellation event may end up 
> closing
> them too.
> So there's definitely a race condition you are risking even if not
> running into.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers 
> wrote:
>
>> SparkContext.cancelJobGroup
>>
>>
>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>> mayur.rust...@gmail.com> wrote:
>>
>>> How do you cancel the job. Which API do you use?
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers >> > wrote:
>>>
 i also noticed that jobs (with a new JobGroupId) which i run
 after this use which use the same RDDs get very confused. i see 
 lots of
 cancelled stages and retries that go on forever.


 On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <
 ko...@tresata.com> wrote:

> i have a running job that i cancel while keeping the spark
> context alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
> Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
> Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
> was cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
> TaskSet 14.0 from pool x
> 2014/03/04 16:43:19 INFO s

Re: trying to understand job cancellation

2014-03-06 Thread Koert Kuipers
its 0.9 snapshot from january running in standalone mode.

have these fixed been merged into 0.9?


On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia wrote:

> Which version of Spark is this in, Koert? There might have been some fixes
> more recently for it.
>
> Matei
>
> On Mar 5, 2014, at 5:26 PM, Koert Kuipers  wrote:
>
> Sorry I meant to say: seems the issue is shared RDDs between a job that
> got cancelled and a later job.
>
> However even disregarding that I have the other issue that the active task
> of the cancelled job hangs around forever, not doing anything
> On Mar 5, 2014 7:29 PM, "Koert Kuipers"  wrote:
>
>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>
>> so it seems the issue is the cached RDDs that are ahred between the
>> cancelled job and the jobs after that.
>>
>>
>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers  wrote:
>>
>>> well, the new jobs use existing RDDs that were also used in the jon that
>>> got killed.
>>>
>>> let me confirm that new jobs that use completely different RDDs do not
>>> get killed.
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi 
>>> wrote:
>>>
 Quite unlikely as jobid are given in an incremental fashion, so your
 future jobid are not likely to be killed if your groupid is not repeated.I
 guess the issue is something else.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers wrote:

> i did that. my next job gets a random new group job id (a uuid).
> however that doesnt seem to stop the job from getting sucked into the
> cancellation it seems
>
>
> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi  > wrote:
>
>> You can randomize job groups as well. to secure yourself against
>> termination.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers wrote:
>>
>>> got it. seems like i better stay away from this feature for now..
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>> mayur.rust...@gmail.com> wrote:
>>>
 One issue is that job cancellation is posted on eventloop. So its
 possible that subsequent jobs submitted to job queue may beat the job
 cancellation event & hence the job cancellation event may end up 
 closing
 them too.
 So there's definitely a race condition you are risking even if not
 running into.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers wrote:

> SparkContext.cancelJobGroup
>
>
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
> mayur.rust...@gmail.com> wrote:
>
>> How do you cancel the job. Which API do you use?
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers 
>> wrote:
>>
>>> i also noticed that jobs (with a new JobGroupId) which i run
>>> after this use which use the same RDDs get very confused. i see 
>>> lots of
>>> cancelled stages and retries that go on forever.
>>>
>>>
>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers >> > wrote:
>>>
 i have a running job that i cancel while keeping the spark
 context alive.

 at the time of cancellation the active stage is 14.

 i see in logs:
 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
 cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 10
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 14
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
 was cancelled
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
 TaskSet 14.0 from pool x
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 13
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 12
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Cancelling stage 11
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
 Can

Re: trying to understand job cancellation

2014-03-05 Thread Matei Zaharia
Which version of Spark is this in, Koert? There might have been some fixes more 
recently for it.

Matei

On Mar 5, 2014, at 5:26 PM, Koert Kuipers  wrote:

> Sorry I meant to say: seems the issue is shared RDDs between a job that got 
> cancelled and a later job.
> 
> However even disregarding that I have the other issue that the active task of 
> the cancelled job hangs around forever, not doing anything
> 
> On Mar 5, 2014 7:29 PM, "Koert Kuipers"  wrote:
> yes jobs on RDDs that were not part of the cancelled job work fine.
> 
> so it seems the issue is the cached RDDs that are ahred between the cancelled 
> job and the jobs after that.
> 
> 
> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers  wrote:
> well, the new jobs use existing RDDs that were also used in the jon that got 
> killed. 
> 
> let me confirm that new jobs that use completely different RDDs do not get 
> killed.
> 
> 
> 
> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi  wrote:
> Quite unlikely as jobid are given in an incremental fashion, so your future 
> jobid are not likely to be killed if your groupid is not repeated.I guess the 
> issue is something else. 
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers  wrote:
> i did that. my next job gets a random new group job id (a uuid). however that 
> doesnt seem to stop the job from getting sucked into the cancellation it seems
> 
> 
> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi  wrote:
> You can randomize job groups as well. to secure yourself against termination. 
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers  wrote:
> got it. seems like i better stay away from this feature for now..
> 
> 
> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi  wrote:
> One issue is that job cancellation is posted on eventloop. So its possible 
> that subsequent jobs submitted to job queue may beat the job cancellation 
> event & hence the job cancellation event may end up closing them too.
> So there's definitely a race condition you are risking even if not running 
> into. 
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers  wrote:
> SparkContext.cancelJobGroup
> 
> 
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi  wrote:
> How do you cancel the job. Which API do you use?
> 
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
> 
> 
> 
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers  wrote:
> i also noticed that jobs (with a new JobGroupId) which i run after this use 
> which use the same RDDs get very confused. i see lots of cancelled stages and 
> retries that go on forever.
> 
> 
> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers  wrote:
> i have a running job that i cancel while keeping the spark context alive.
> 
> at the time of cancellation the active stage is 14.
> 
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group 
> 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 
> from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
> 
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with 
> state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with 
> state KILLED from TID 888 because its task set is gone
> 
> after this stage 14 hangs around in active stages, without any sign of 
> progress or cancellation. it just sits there forever, stuck. looking at the 
> logs of the executors confirms this. they task seem to be still running, but 
> nothing is happening. for example (by the time i look at this its 4:58 so 
> this tasks hasnt done anything in 15 mins):
> 
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
> 14/03/04 16:43:19 INFO BlockManager: Removing R

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
Sorry I meant to say: seems the issue is shared RDDs between a job that got
cancelled and a later job.

However even disregarding that I have the other issue that the active task
of the cancelled job hangs around forever, not doing anything
On Mar 5, 2014 7:29 PM, "Koert Kuipers"  wrote:

> yes jobs on RDDs that were not part of the cancelled job work fine.
>
> so it seems the issue is the cached RDDs that are ahred between the
> cancelled job and the jobs after that.
>
>
> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers  wrote:
>
>> well, the new jobs use existing RDDs that were also used in the jon that
>> got killed.
>>
>> let me confirm that new jobs that use completely different RDDs do not
>> get killed.
>>
>>
>>
>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi wrote:
>>
>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>> guess the issue is something else.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers  wrote:
>>>
 i did that. my next job gets a random new group job id (a uuid).
 however that doesnt seem to stop the job from getting sucked into the
 cancellation it seems


 On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi 
 wrote:

> You can randomize job groups as well. to secure yourself against
> termination.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers wrote:
>
>> got it. seems like i better stay away from this feature for now..
>>
>>
>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>> mayur.rust...@gmail.com> wrote:
>>
>>> One issue is that job cancellation is posted on eventloop. So its
>>> possible that subsequent jobs submitted to job queue may beat the job
>>> cancellation event & hence the job cancellation event may end up closing
>>> them too.
>>> So there's definitely a race condition you are risking even if not
>>> running into.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers wrote:
>>>
 SparkContext.cancelJobGroup


 On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
 mayur.rust...@gmail.com> wrote:

> How do you cancel the job. Which API do you use?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
>  @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers 
> wrote:
>
>> i also noticed that jobs (with a new JobGroupId) which i run
>> after this use which use the same RDDs get very confused. i see lots 
>> of
>> cancelled stages and retries that go on forever.
>>
>>
>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers 
>> wrote:
>>
>>> i have a running job that i cancel while keeping the spark
>>> context alive.
>>>
>>> at the time of cancellation the active stage is 14.
>>>
>>> i see in logs:
>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 10
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 14
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>> was cancelled
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>> TaskSet 14.0 from pool x
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 13
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 12
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 11
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 15
>>>
>>> so far it all looks good. then i get a lot of messages like this:
>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>> update with state FINISHED from TID 883 because its task set is gone
>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>> update with state KILLED from TID 888 because its task set is gone
>>>
>>> after this stage 14 hangs aroun

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
yes jobs on RDDs that were not part of the cancelled job work fine.

so it seems the issue is the cached RDDs that are ahred between the
cancelled job and the jobs after that.


On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers  wrote:

> well, the new jobs use existing RDDs that were also used in the jon that
> got killed.
>
> let me confirm that new jobs that use completely different RDDs do not get
> killed.
>
>
>
> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi wrote:
>
>> Quite unlikely as jobid are given in an incremental fashion, so your
>> future jobid are not likely to be killed if your groupid is not repeated.I
>> guess the issue is something else.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers  wrote:
>>
>>> i did that. my next job gets a random new group job id (a uuid). however
>>> that doesnt seem to stop the job from getting sucked into the cancellation
>>> it seems
>>>
>>>
>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi 
>>> wrote:
>>>
 You can randomize job groups as well. to secure yourself against
 termination.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers wrote:

> got it. seems like i better stay away from this feature for now..
>
>
> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi  > wrote:
>
>> One issue is that job cancellation is posted on eventloop. So its
>> possible that subsequent jobs submitted to job queue may beat the job
>> cancellation event & hence the job cancellation event may end up closing
>> them too.
>> So there's definitely a race condition you are risking even if not
>> running into.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers wrote:
>>
>>> SparkContext.cancelJobGroup
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>> mayur.rust...@gmail.com> wrote:
>>>
 How do you cancel the job. Which API do you use?

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi 



 On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers wrote:

> i also noticed that jobs (with a new JobGroupId) which i run after
> this use which use the same RDDs get very confused. i see lots of 
> cancelled
> stages and retries that go on forever.
>
>
> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers 
> wrote:
>
>> i have a running job that i cancel while keeping the spark
>> context alive.
>>
>> at the time of cancellation the active stage is 14.
>>
>> i see in logs:
>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 10
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 14
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>> was cancelled
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>> TaskSet 14.0 from pool x
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 13
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 12
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 11
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 15
>>
>> so far it all looks good. then i get a lot of messages like this:
>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>> update with state FINISHED from TID 883 because its task set is gone
>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>> update with state KILLED from TID 888 because its task set is gone
>>
>> after this stage 14 hangs around in active stages, without any
>> sign of progress or cancellation. it just sits there forever, stuck.
>> looking at the logs of the executors confirms this. they task seem 
>> to be
>> still running, but nothing is happening. for example (by the time i 
>> look at
>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>

Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
well, the new jobs use existing RDDs that were also used in the jon that
got killed.

let me confirm that new jobs that use completely different RDDs do not get
killed.



On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi wrote:

> Quite unlikely as jobid are given in an incremental fashion, so your
> future jobid are not likely to be killed if your groupid is not repeated.I
> guess the issue is something else.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers  wrote:
>
>> i did that. my next job gets a random new group job id (a uuid). however
>> that doesnt seem to stop the job from getting sucked into the cancellation
>> it seems
>>
>>
>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi wrote:
>>
>>> You can randomize job groups as well. to secure yourself against
>>> termination.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers  wrote:
>>>
 got it. seems like i better stay away from this feature for now..


 On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi 
 wrote:

> One issue is that job cancellation is posted on eventloop. So its
> possible that subsequent jobs submitted to job queue may beat the job
> cancellation event & hence the job cancellation event may end up closing
> them too.
> So there's definitely a race condition you are risking even if not
> running into.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers wrote:
>
>> SparkContext.cancelJobGroup
>>
>>
>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>> mayur.rust...@gmail.com> wrote:
>>
>>> How do you cancel the job. Which API do you use?
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers wrote:
>>>
 i also noticed that jobs (with a new JobGroupId) which i run after
 this use which use the same RDDs get very confused. i see lots of 
 cancelled
 stages and retries that go on forever.


 On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers wrote:

> i have a running job that i cancel while keeping the spark context
> alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
> stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
> stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
> cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
> TaskSet 14.0 from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
> stage 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
> stage 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
> stage 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
> stage 15
>
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
> update with state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
> update with state KILLED from TID 888 because its task set is gone
>
> after this stage 14 hangs around in active stages, without any
> sign of progress or cancellation. it just sits there forever, stuck.
> looking at the logs of the executors confirms this. they task seem to 
> be
> still running, but nothing is happening. for example (by the time i 
> look at
> this its 4:58 so this tasks hasnt done anything in 15 mins):
>
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
> is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
> to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
> is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
> to driver
> 14/

Re: trying to understand job cancellation

2014-03-05 Thread Mayur Rustagi
Quite unlikely as jobid are given in an incremental fashion, so your future
jobid are not likely to be killed if your groupid is not repeated.I guess
the issue is something else.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers  wrote:

> i did that. my next job gets a random new group job id (a uuid). however
> that doesnt seem to stop the job from getting sucked into the cancellation
> it seems
>
>
> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi wrote:
>
>> You can randomize job groups as well. to secure yourself against
>> termination.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers  wrote:
>>
>>> got it. seems like i better stay away from this feature for now..
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi 
>>> wrote:
>>>
 One issue is that job cancellation is posted on eventloop. So its
 possible that subsequent jobs submitted to job queue may beat the job
 cancellation event & hence the job cancellation event may end up closing
 them too.
 So there's definitely a race condition you are risking even if not
 running into.

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
 @mayur_rustagi 



 On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers wrote:

> SparkContext.cancelJobGroup
>
>
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi  > wrote:
>
>> How do you cancel the job. Which API do you use?
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers wrote:
>>
>>> i also noticed that jobs (with a new JobGroupId) which i run after
>>> this use which use the same RDDs get very confused. i see lots of 
>>> cancelled
>>> stages and retries that go on forever.
>>>
>>>
>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers wrote:
>>>
 i have a running job that i cancel while keeping the spark context
 alive.

 at the time of cancellation the active stage is 14.

 i see in logs:
 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
 job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
 stage 10
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
 stage 14
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
 cancelled
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
 TaskSet 14.0 from pool x
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
 stage 13
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
 stage 12
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
 stage 11
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
 stage 15

 so far it all looks good. then i get a lot of messages like this:
 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
 update with state FINISHED from TID 883 because its task set is gone
 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
 update with state KILLED from TID 888 because its task set is gone

 after this stage 14 hangs around in active stages, without any sign
 of progress or cancellation. it just sits there forever, stuck. 
 looking at
 the logs of the executors confirms this. they task seem to be still
 running, but nothing is happening. for example (by the time i look at 
 this
 its 4:58 so this tasks hasnt done anything in 15 mins):

 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
 is 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
 driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 943
 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
 is 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
 driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 945
 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66

 not sure what to make of this. any suggestions? best, koert

>>>
>>>
>>
>

>>>
>>
>


Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
i did that. my next job gets a random new group job id (a uuid). however
that doesnt seem to stop the job from getting sucked into the cancellation
it seems


On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi wrote:

> You can randomize job groups as well. to secure yourself against
> termination.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers  wrote:
>
>> got it. seems like i better stay away from this feature for now..
>>
>>
>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi wrote:
>>
>>> One issue is that job cancellation is posted on eventloop. So its
>>> possible that subsequent jobs submitted to job queue may beat the job
>>> cancellation event & hence the job cancellation event may end up closing
>>> them too.
>>> So there's definitely a race condition you are risking even if not
>>> running into.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers  wrote:
>>>
 SparkContext.cancelJobGroup


 On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi 
 wrote:

> How do you cancel the job. Which API do you use?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
>  @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers wrote:
>
>> i also noticed that jobs (with a new JobGroupId) which i run after
>> this use which use the same RDDs get very confused. i see lots of 
>> cancelled
>> stages and retries that go on forever.
>>
>>
>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers wrote:
>>
>>> i have a running job that i cancel while keeping the spark context
>>> alive.
>>>
>>> at the time of cancellation the active stage is 14.
>>>
>>> i see in logs:
>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 10
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 14
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>> cancelled
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>> 14.0 from pool x
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 13
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 12
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 11
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>> stage 15
>>>
>>> so far it all looks good. then i get a lot of messages like this:
>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>> update with state FINISHED from TID 883 because its task set is gone
>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>> update with state KILLED from TID 888 because its task set is gone
>>>
>>> after this stage 14 hangs around in active stages, without any sign
>>> of progress or cancellation. it just sits there forever, stuck. looking 
>>> at
>>> the logs of the executors confirms this. they task seem to be still
>>> running, but nothing is happening. for example (by the time i look at 
>>> this
>>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>>
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943
>>> is 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945
>>> is 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>
>>> not sure what to make of this. any suggestions? best, koert
>>>
>>
>>
>

>>>
>>
>


Re: trying to understand job cancellation

2014-03-05 Thread Mayur Rustagi
You can randomize job groups as well. to secure yourself against
termination.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers  wrote:

> got it. seems like i better stay away from this feature for now..
>
>
> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi wrote:
>
>> One issue is that job cancellation is posted on eventloop. So its
>> possible that subsequent jobs submitted to job queue may beat the job
>> cancellation event & hence the job cancellation event may end up closing
>> them too.
>> So there's definitely a race condition you are risking even if not
>> running into.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers  wrote:
>>
>>> SparkContext.cancelJobGroup
>>>
>>>
>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi 
>>> wrote:
>>>
 How do you cancel the job. Which API do you use?

 Mayur Rustagi
 Ph: +1 (760) 203 3257
 http://www.sigmoidanalytics.com
  @mayur_rustagi 



 On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers wrote:

> i also noticed that jobs (with a new JobGroupId) which i run after
> this use which use the same RDDs get very confused. i see lots of 
> cancelled
> stages and retries that go on forever.
>
>
> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers wrote:
>
>> i have a running job that i cancel while keeping the spark context
>> alive.
>>
>> at the time of cancellation the active stage is 14.
>>
>> i see in logs:
>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 10
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 14
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>> cancelled
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>> 14.0 from pool x
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 13
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 12
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 11
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>> stage 15
>>
>> so far it all looks good. then i get a lot of messages like this:
>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>> with state FINISHED from TID 883 because its task set is gone
>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>> with state KILLED from TID 888 because its task set is gone
>>
>> after this stage 14 hangs around in active stages, without any sign
>> of progress or cancellation. it just sits there forever, stuck. looking 
>> at
>> the logs of the executors confirms this. they task seem to be still
>> running, but nothing is happening. for example (by the time i look at 
>> this
>> its 4:58 so this tasks hasnt done anything in 15 mins):
>>
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>> 1007
>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>> driver
>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>> 1007
>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>> driver
>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>
>> not sure what to make of this. any suggestions? best, koert
>>
>
>

>>>
>>
>


Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
got it. seems like i better stay away from this feature for now..


On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi wrote:

> One issue is that job cancellation is posted on eventloop. So its possible
> that subsequent jobs submitted to job queue may beat the job cancellation
> event & hence the job cancellation event may end up closing them too.
> So there's definitely a race condition you are risking even if not running
> into.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers  wrote:
>
>> SparkContext.cancelJobGroup
>>
>>
>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi wrote:
>>
>>> How do you cancel the job. Which API do you use?
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>>  @mayur_rustagi 
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers  wrote:
>>>
 i also noticed that jobs (with a new JobGroupId) which i run after this
 use which use the same RDDs get very confused. i see lots of cancelled
 stages and retries that go on forever.


 On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers wrote:

> i have a running job that i cancel while keeping the spark context
> alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
> 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
> 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
> cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
> 14.0 from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
> 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
> 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
> 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
> 15
>
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
> with state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
> with state KILLED from TID 888 because its task set is gone
>
> after this stage 14 hangs around in active stages, without any sign of
> progress or cancellation. it just sits there forever, stuck. looking at 
> the
> logs of the executors confirms this. they task seem to be still running,
> but nothing is happening. for example (by the time i look at this its 4:58
> so this tasks hasnt done anything in 15 mins):
>
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
> 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
> driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
> 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
> driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>
> not sure what to make of this. any suggestions? best, koert
>


>>>
>>
>


Re: trying to understand job cancellation

2014-03-05 Thread Mayur Rustagi
One issue is that job cancellation is posted on eventloop. So its possible
that subsequent jobs submitted to job queue may beat the job cancellation
event & hence the job cancellation event may end up closing them too.
So there's definitely a race condition you are risking even if not running
into.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers  wrote:

> SparkContext.cancelJobGroup
>
>
> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi wrote:
>
>> How do you cancel the job. Which API do you use?
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>>  @mayur_rustagi 
>>
>>
>>
>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers  wrote:
>>
>>> i also noticed that jobs (with a new JobGroupId) which i run after this
>>> use which use the same RDDs get very confused. i see lots of cancelled
>>> stages and retries that go on forever.
>>>
>>>
>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers  wrote:
>>>
 i have a running job that i cancel while keeping the spark context
 alive.

 at the time of cancellation the active stage is 14.

 i see in logs:
 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
 group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 10
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 14
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
 cancelled
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
 14.0 from pool x
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 13
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 12
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 11
 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage
 15

 so far it all looks good. then i get a lot of messages like this:
 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
 with state FINISHED from TID 883 because its task set is gone
 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
 with state KILLED from TID 888 because its task set is gone

 after this stage 14 hangs around in active stages, without any sign of
 progress or cancellation. it just sits there forever, stuck. looking at the
 logs of the executors confirms this. they task seem to be still running,
 but nothing is happening. for example (by the time i look at this its 4:58
 so this tasks hasnt done anything in 15 mins):

 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
 driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 943
 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
 1007
 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
 driver
 14/03/04 16:43:16 INFO Executor: Finished task ID 945
 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66

 not sure what to make of this. any suggestions? best, koert

>>>
>>>
>>
>


Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
SparkContext.cancelJobGroup


On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi wrote:

> How do you cancel the job. Which API do you use?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi 
>
>
>
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers  wrote:
>
>> i also noticed that jobs (with a new JobGroupId) which i run after this
>> use which use the same RDDs get very confused. i see lots of cancelled
>> stages and retries that go on forever.
>>
>>
>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers  wrote:
>>
>>> i have a running job that i cancel while keeping the spark context alive.
>>>
>>> at the time of cancellation the active stage is 14.
>>>
>>> i see in logs:
>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>> cancelled
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>> 14.0 from pool x
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>>>
>>> so far it all looks good. then i get a lot of messages like this:
>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>> with state FINISHED from TID 883 because its task set is gone
>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>> with state KILLED from TID 888 because its task set is gone
>>>
>>> after this stage 14 hangs around in active stages, without any sign of
>>> progress or cancellation. it just sits there forever, stuck. looking at the
>>> logs of the executors confirms this. they task seem to be still running,
>>> but nothing is happening. for example (by the time i look at this its 4:58
>>> so this tasks hasnt done anything in 15 mins):
>>>
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>> 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>> 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>
>>> not sure what to make of this. any suggestions? best, koert
>>>
>>
>>
>


Re: trying to understand job cancellation

2014-03-05 Thread Mayur Rustagi
How do you cancel the job. Which API do you use?

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi 



On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers  wrote:

> i also noticed that jobs (with a new JobGroupId) which i run after this
> use which use the same RDDs get very confused. i see lots of cancelled
> stages and retries that go on forever.
>
>
> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers  wrote:
>
>> i have a running job that i cancel while keeping the spark context alive.
>>
>> at the time of cancellation the active stage is 14.
>>
>> i see in logs:
>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>> cancelled
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
>> from pool x
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>>
>> so far it all looks good. then i get a lot of messages like this:
>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>> with state FINISHED from TID 883 because its task set is gone
>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>> with state KILLED from TID 888 because its task set is gone
>>
>> after this stage 14 hangs around in active stages, without any sign of
>> progress or cancellation. it just sits there forever, stuck. looking at the
>> logs of the executors confirms this. they task seem to be still running,
>> but nothing is happening. for example (by the time i look at this its 4:58
>> so this tasks hasnt done anything in 15 mins):
>>
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>
>> not sure what to make of this. any suggestions? best, koert
>>
>
>


Re: trying to understand job cancellation

2014-03-05 Thread Koert Kuipers
i also noticed that jobs (with a new JobGroupId) which i run after this use
which use the same RDDs get very confused. i see lots of cancelled stages
and retries that go on forever.


On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers  wrote:

> i have a running job that i cancel while keeping the spark context alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group
> 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
> cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
> from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with
> state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with
> state KILLED from TID 888 because its task set is gone
>
> after this stage 14 hangs around in active stages, without any sign of
> progress or cancellation. it just sits there forever, stuck. looking at the
> logs of the executors confirms this. they task seem to be still running,
> but nothing is happening. for example (by the time i look at this its 4:58
> so this tasks hasnt done anything in 15 mins):
>
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>
> not sure what to make of this. any suggestions? best, koert
>


trying to understand job cancellation

2014-03-04 Thread Koert Kuipers
i have a running job that i cancel while keeping the spark context alive.

at the time of cancellation the active stage is 14.

i see in logs:
2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group
3a25db23-2e39-4497-b7ab-b26b2a976f9c
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was cancelled
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
from pool x
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15

so far it all looks good. then i get a lot of messages like this:
2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with
state FINISHED from TID 883 because its task set is gone
2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with
state KILLED from TID 888 because its task set is gone

after this stage 14 hangs around in active stages, without any sign of
progress or cancellation. it just sits there forever, stuck. looking at the
logs of the executors confirms this. they task seem to be still running,
but nothing is happening. for example (by the time i look at this its 4:58
so this tasks hasnt done anything in 15 mins):

14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
14/03/04 16:43:16 INFO Executor: Finished task ID 943
14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
14/03/04 16:43:16 INFO Executor: Finished task ID 945
14/03/04 16:43:19 INFO BlockManager: Removing RDD 66

not sure what to make of this. any suggestions? best, koert