i also noticed that jobs (with a new JobGroupId) which i run after this use which use the same RDDs get very confused. i see lots of cancelled stages and retries that go on forever.
On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > i have a running job that i cancel while keeping the spark context alive. > > at the time of cancellation the active stage is 14. > > i see in logs: > 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group > 3a25db23-2e39-4497-b7ab-b26b2a976f9c > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10 > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14 > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was > cancelled > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 > from pool x > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13 > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12 > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11 > 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15 > > so far it all looks good. then i get a lot of messages like this: > 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with > state FINISHED from TID 883 because its task set is gone > 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with > state KILLED from TID 888 because its task set is gone > > after this stage 14 hangs around in active stages, without any sign of > progress or cancellation. it just sits there forever, stuck. looking at the > logs of the executors confirms this. they task seem to be still running, > but nothing is happening. for example (by the time i look at this its 4:58 > so this tasks hasnt done anything in 15 mins): > > 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007 > 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver > 14/03/04 16:43:16 INFO Executor: Finished task ID 943 > 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007 > 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver > 14/03/04 16:43:16 INFO Executor: Finished task ID 945 > 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 > > not sure what to make of this. any suggestions? best, koert >