Re: trying to understand job cancellation

Koert Kuipers Wed, 19 Mar 2014 10:34:26 -0700

on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no
issues yet.



On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers <ko...@tresata.com> wrote:

> its 0.9 snapshot from january running in standalone mode.
>
> have these fixed been merged into 0.9?
>
>
> On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia <matei.zaha...@gmail.com>wrote:
>
>> Which version of Spark is this in, Koert? There might have been some
>> fixes more recently for it.
>>
>> Matei
>>
>> On Mar 5, 2014, at 5:26 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> Sorry I meant to say: seems the issue is shared RDDs between a job that
>> got cancelled and a later job.
>>
>> However even disregarding that I have the other issue that the active
>> task of the cancelled job hangs around forever, not doing anything....
>> On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> yes jobs on RDDs that were not part of the cancelled job work fine.
>>>
>>> so it seems the issue is the cached RDDs that are ahred between the
>>> cancelled job and the jobs after that.
>>>
>>>
>>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> well, the new jobs use existing RDDs that were also used in the jon
>>>> that got killed.
>>>>
>>>> let me confirm that new jobs that use completely different RDDs do not
>>>> get killed.
>>>>
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi 
>>>> <mayur.rust...@gmail.com>wrote:
>>>>
>>>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>>>> guess the issue is something else.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> i did that. my next job gets a random new group job id (a uuid).
>>>>>> however that doesnt seem to stop the job from getting sucked into the
>>>>>> cancellation it seems
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi <
>>>>>> mayur.rust...@gmail.com> wrote:
>>>>>>
>>>>>>> You can randomize job groups as well. to secure yourself against
>>>>>>> termination.
>>>>>>>
>>>>>>> Mayur Rustagi
>>>>>>> Ph: +1 (760) 203 3257
>>>>>>> http://www.sigmoidanalytics.com
>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>
>>>>>>>> got it. seems like i better stay away from this feature for now..
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>>>>>>> mayur.rust...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>>>>>> cancellation event & hence the job cancellation event may end up 
>>>>>>>>> closing
>>>>>>>>> them too.
>>>>>>>>> So there's definitely a race condition you are risking even if not
>>>>>>>>> running into.
>>>>>>>>>
>>>>>>>>> Mayur Rustagi
>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers 
>>>>>>>>> <ko...@tresata.com>wrote:
>>>>>>>>>
>>>>>>>>>> SparkContext.cancelJobGroup
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>>>>> mayur.rust...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>>>>
>>>>>>>>>>> Mayur Rustagi
>>>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run
>>>>>>>>>>>> after this use which use the same RDDs get very confused. i see 
>>>>>>>>>>>> lots of
>>>>>>>>>>>> cancelled stages and retries that go on forever.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <
>>>>>>>>>>>> ko...@tresata.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> i have a running job that i cancel while keeping the spark
>>>>>>>>>>>>> context alive.
>>>>>>>>>>>>>
>>>>>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>>>>>
>>>>>>>>>>>>> i see in logs:
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to
>>>>>>>>>>>>> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 10
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 14
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>>>>>>>>>>>> was cancelled
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 13
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 12
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 11
>>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl:
>>>>>>>>>>>>> Cancelling stage 15
>>>>>>>>>>>>>
>>>>>>>>>>>>> so far it all looks good. then i get a lot of messages like
>>>>>>>>>>>>> this:
>>>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>>>> update with state FINISHED from TID 883 because its task set is 
>>>>>>>>>>>>> gone
>>>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>>>>>
>>>>>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>>>>>> sign of progress or cancellation. it just sits there forever, 
>>>>>>>>>>>>> stuck.
>>>>>>>>>>>>> looking at the logs of the executors confirms this. they task 
>>>>>>>>>>>>> seem to be
>>>>>>>>>>>>> still running, but nothing is happening. for example (by the time 
>>>>>>>>>>>>> i look at
>>>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>>>>>
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>>>> 943 is 1007
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943
>>>>>>>>>>>>> directly to driver
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>>>> 945 is 1007
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945
>>>>>>>>>>>>> directly to driver
>>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>>>>>
>>>>>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Reply via email to