Re: trying to understand job cancellation

Koert Kuipers Wed, 05 Mar 2014 14:41:36 -0800

SparkContext.cancelJobGroup


On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> How do you cancel the job. Which API do you use?
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i also noticed that jobs (with a new JobGroupId) which i run after this
>> use which use the same RDDs get very confused. i see lots of cancelled
>> stages and retries that go on forever.
>>
>>
>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i have a running job that i cancel while keeping the spark context alive.
>>>
>>> at the time of cancellation the active stage is 14.
>>>
>>> i see in logs:
>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job
>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
>>> cancelled
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet
>>> 14.0 from pool x
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>>>
>>> so far it all looks good. then i get a lot of messages like this:
>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>> with state FINISHED from TID 883 because its task set is gone
>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update
>>> with state KILLED from TID 888 because its task set is gone
>>>
>>> after this stage 14 hangs around in active stages, without any sign of
>>> progress or cancellation. it just sits there forever, stuck. looking at the
>>> logs of the executors confirms this. they task seem to be still running,
>>> but nothing is happening. for example (by the time i look at this its 4:58
>>> so this tasks hasnt done anything in 15 mins):
>>>
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is
>>> 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is
>>> 1007
>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to
>>> driver
>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>>>
>>> not sure what to make of this. any suggestions? best, koert
>>>
>>
>>
>

Re: trying to understand job cancellation

Reply via email to