Re: trying to understand job cancellation

Koert Kuipers Wed, 05 Mar 2014 17:28:30 -0800

Sorry I meant to say: seems the issue is shared RDDs between a job that got
cancelled and a later job.


However even disregarding that I have the other issue that the active task
of the cancelled job hangs around forever, not doing anything....
On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote:

> yes jobs on RDDs that were not part of the cancelled job work fine.
>
> so it seems the issue is the cached RDDs that are ahred between the
> cancelled job and the jobs after that.
>
>
> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> well, the new jobs use existing RDDs that were also used in the jon that
>> got killed.
>>
>> let me confirm that new jobs that use completely different RDDs do not
>> get killed.
>>
>>
>>
>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:
>>
>>> Quite unlikely as jobid are given in an incremental fashion, so your
>>> future jobid are not likely to be killed if your groupid is not repeated.I
>>> guess the issue is something else.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> i did that. my next job gets a random new group job id (a uuid).
>>>> however that doesnt seem to stop the job from getting sucked into the
>>>> cancellation it seems
>>>>
>>>>
>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi 
>>>> <mayur.rust...@gmail.com>wrote:
>>>>
>>>>> You can randomize job groups as well. to secure yourself against
>>>>> termination.
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>
>>>>>> got it. seems like i better stay away from this feature for now..
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi <
>>>>>> mayur.rust...@gmail.com> wrote:
>>>>>>
>>>>>>> One issue is that job cancellation is posted on eventloop. So its
>>>>>>> possible that subsequent jobs submitted to job queue may beat the job
>>>>>>> cancellation event & hence the job cancellation event may end up closing
>>>>>>> them too.
>>>>>>> So there's definitely a race condition you are risking even if not
>>>>>>> running into.
>>>>>>>
>>>>>>> Mayur Rustagi
>>>>>>> Ph: +1 (760) 203 3257
>>>>>>> http://www.sigmoidanalytics.com
>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com>wrote:
>>>>>>>
>>>>>>>> SparkContext.cancelJobGroup
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <
>>>>>>>> mayur.rust...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> How do you cancel the job. Which API do you use?
>>>>>>>>>
>>>>>>>>> Mayur Rustagi
>>>>>>>>> Ph: +1 (760) 203 3257
>>>>>>>>> http://www.sigmoidanalytics.com
>>>>>>>>>  @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers 
>>>>>>>>> <ko...@tresata.com>wrote:
>>>>>>>>>
>>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run
>>>>>>>>>> after this use which use the same RDDs get very confused. i see lots 
>>>>>>>>>> of
>>>>>>>>>> cancelled stages and retries that go on forever.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers 
>>>>>>>>>> <ko...@tresata.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> i have a running job that i cancel while keeping the spark
>>>>>>>>>>> context alive.
>>>>>>>>>>>
>>>>>>>>>>> at the time of cancellation the active stage is 14.
>>>>>>>>>>>
>>>>>>>>>>> i see in logs:
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel
>>>>>>>>>>> job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 10
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 14
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14
>>>>>>>>>>> was cancelled
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove
>>>>>>>>>>> TaskSet 14.0 from pool x
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 13
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 12
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 11
>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling
>>>>>>>>>>> stage 15
>>>>>>>>>>>
>>>>>>>>>>> so far it all looks good. then i get a lot of messages like this:
>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>> update with state FINISHED from TID 883 because its task set is gone
>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring
>>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone
>>>>>>>>>>>
>>>>>>>>>>> after this stage 14 hangs around in active stages, without any
>>>>>>>>>>> sign of progress or cancellation. it just sits there forever, stuck.
>>>>>>>>>>> looking at the logs of the executors confirms this. they task seem 
>>>>>>>>>>> to be
>>>>>>>>>>> still running, but nothing is happening. for example (by the time i 
>>>>>>>>>>> look at
>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins):
>>>>>>>>>>>
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>> 943 is 1007
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly
>>>>>>>>>>> to driver
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for
>>>>>>>>>>> 945 is 1007
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly
>>>>>>>>>>> to driver
>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>>>>>>>>>
>>>>>>>>>>> not sure what to make of this. any suggestions? best, koert
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: trying to understand job cancellation

Reply via email to