i also noticed that jobs (with a new JobGroupId) which i run after this use
which use the same RDDs get very confused. i see lots of cancelled stages
and retries that go on forever.


On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i have a running job that i cancel while keeping the spark context alive.
>
> at the time of cancellation the active stage is 14.
>
> i see in logs:
> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job group
> 3a25db23-2e39-4497-b7ab-b26b2a976f9c
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was
> cancelled
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0
> from pool x
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11
> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15
>
> so far it all looks good. then i get a lot of messages like this:
> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update with
> state FINISHED from TID 883 because its task set is gone
> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update with
> state KILLED from TID 888 because its task set is gone
>
> after this stage 14 hangs around in active stages, without any sign of
> progress or cancellation. it just sits there forever, stuck. looking at the
> logs of the executors confirms this. they task seem to be still running,
> but nothing is happening. for example (by the time i look at this its 4:58
> so this tasks hasnt done anything in 15 mins):
>
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 943
> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007
> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver
> 14/03/04 16:43:16 INFO Executor: Finished task ID 945
> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66
>
> not sure what to make of this. any suggestions? best, koert
>

Reply via email to