One issue is that job cancellation is posted on eventloop. So its possible that subsequent jobs submitted to job queue may beat the job cancellation event & hence the job cancellation event may end up closing them too. So there's definitely a race condition you are risking even if not running into.
Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers <ko...@tresata.com> wrote: > SparkContext.cancelJobGroup > > > On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi <mayur.rust...@gmail.com>wrote: > >> How do you cancel the job. Which API do you use? >> >> Mayur Rustagi >> Ph: +1 (760) 203 3257 >> http://www.sigmoidanalytics.com >> @mayur_rustagi <https://twitter.com/mayur_rustagi> >> >> >> >> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> i also noticed that jobs (with a new JobGroupId) which i run after this >>> use which use the same RDDs get very confused. i see lots of cancelled >>> stages and retries that go on forever. >>> >>> >>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> i have a running job that i cancel while keeping the spark context >>>> alive. >>>> >>>> at the time of cancellation the active stage is 14. >>>> >>>> i see in logs: >>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job >>>> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>> 10 >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>> 14 >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was >>>> cancelled >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet >>>> 14.0 from pool x >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>> 13 >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>> 12 >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>> 11 >>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage >>>> 15 >>>> >>>> so far it all looks good. then i get a lot of messages like this: >>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update >>>> with state FINISHED from TID 883 because its task set is gone >>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update >>>> with state KILLED from TID 888 because its task set is gone >>>> >>>> after this stage 14 hangs around in active stages, without any sign of >>>> progress or cancellation. it just sits there forever, stuck. looking at the >>>> logs of the executors confirms this. they task seem to be still running, >>>> but nothing is happening. for example (by the time i look at this its 4:58 >>>> so this tasks hasnt done anything in 15 mins): >>>> >>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is >>>> 1007 >>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to >>>> driver >>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943 >>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is >>>> 1007 >>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to >>>> driver >>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945 >>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 >>>> >>>> not sure what to make of this. any suggestions? best, koert >>>> >>> >>> >> >