How do you cancel the job. Which API do you use? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi <https://twitter.com/mayur_rustagi>
On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com> wrote: > i also noticed that jobs (with a new JobGroupId) which i run after this > use which use the same RDDs get very confused. i see lots of cancelled > stages and retries that go on forever. > > > On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> i have a running job that i cancel while keeping the spark context alive. >> >> at the time of cancellation the active stage is 14. >> >> i see in logs: >> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to cancel job >> group 3a25db23-2e39-4497-b7ab-b26b2a976f9c >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 10 >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 14 >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 was >> cancelled >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove TaskSet 14.0 >> from pool x >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 13 >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 12 >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 11 >> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Cancelling stage 15 >> >> so far it all looks good. then i get a lot of messages like this: >> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring update >> with state FINISHED from TID 883 because its task set is gone >> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring update >> with state KILLED from TID 888 because its task set is gone >> >> after this stage 14 hangs around in active stages, without any sign of >> progress or cancellation. it just sits there forever, stuck. looking at the >> logs of the executors confirms this. they task seem to be still running, >> but nothing is happening. for example (by the time i look at this its 4:58 >> so this tasks hasnt done anything in 15 mins): >> >> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 943 is 1007 >> 14/03/04 16:43:16 INFO Executor: Sending result for 943 directly to driver >> 14/03/04 16:43:16 INFO Executor: Finished task ID 943 >> 14/03/04 16:43:16 INFO Executor: Serialized size of result for 945 is 1007 >> 14/03/04 16:43:16 INFO Executor: Sending result for 945 directly to driver >> 14/03/04 16:43:16 INFO Executor: Finished task ID 945 >> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 >> >> not sure what to make of this. any suggestions? best, koert >> > >