on spark 1.0.0 SNAPSHOT this seems to work. at least so far i have seen no issues yet.
On Thu, Mar 6, 2014 at 8:44 AM, Koert Kuipers <ko...@tresata.com> wrote: > its 0.9 snapshot from january running in standalone mode. > > have these fixed been merged into 0.9? > > > On Thu, Mar 6, 2014 at 12:45 AM, Matei Zaharia <matei.zaha...@gmail.com>wrote: > >> Which version of Spark is this in, Koert? There might have been some >> fixes more recently for it. >> >> Matei >> >> On Mar 5, 2014, at 5:26 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >> Sorry I meant to say: seems the issue is shared RDDs between a job that >> got cancelled and a later job. >> >> However even disregarding that I have the other issue that the active >> task of the cancelled job hangs around forever, not doing anything.... >> On Mar 5, 2014 7:29 PM, "Koert Kuipers" <ko...@tresata.com> wrote: >> >>> yes jobs on RDDs that were not part of the cancelled job work fine. >>> >>> so it seems the issue is the cached RDDs that are ahred between the >>> cancelled job and the jobs after that. >>> >>> >>> On Wed, Mar 5, 2014 at 7:15 PM, Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> well, the new jobs use existing RDDs that were also used in the jon >>>> that got killed. >>>> >>>> let me confirm that new jobs that use completely different RDDs do not >>>> get killed. >>>> >>>> >>>> >>>> On Wed, Mar 5, 2014 at 7:00 PM, Mayur Rustagi >>>> <mayur.rust...@gmail.com>wrote: >>>> >>>>> Quite unlikely as jobid are given in an incremental fashion, so your >>>>> future jobid are not likely to be killed if your groupid is not repeated.I >>>>> guess the issue is something else. >>>>> >>>>> Mayur Rustagi >>>>> Ph: +1 (760) 203 3257 >>>>> http://www.sigmoidanalytics.com >>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>> >>>>> >>>>> >>>>> On Wed, Mar 5, 2014 at 3:50 PM, Koert Kuipers <ko...@tresata.com>wrote: >>>>> >>>>>> i did that. my next job gets a random new group job id (a uuid). >>>>>> however that doesnt seem to stop the job from getting sucked into the >>>>>> cancellation it seems >>>>>> >>>>>> >>>>>> On Wed, Mar 5, 2014 at 6:47 PM, Mayur Rustagi < >>>>>> mayur.rust...@gmail.com> wrote: >>>>>> >>>>>>> You can randomize job groups as well. to secure yourself against >>>>>>> termination. >>>>>>> >>>>>>> Mayur Rustagi >>>>>>> Ph: +1 (760) 203 3257 >>>>>>> http://www.sigmoidanalytics.com >>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 5, 2014 at 3:42 PM, Koert Kuipers <ko...@tresata.com>wrote: >>>>>>> >>>>>>>> got it. seems like i better stay away from this feature for now.. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 5, 2014 at 5:55 PM, Mayur Rustagi < >>>>>>>> mayur.rust...@gmail.com> wrote: >>>>>>>> >>>>>>>>> One issue is that job cancellation is posted on eventloop. So its >>>>>>>>> possible that subsequent jobs submitted to job queue may beat the job >>>>>>>>> cancellation event & hence the job cancellation event may end up >>>>>>>>> closing >>>>>>>>> them too. >>>>>>>>> So there's definitely a race condition you are risking even if not >>>>>>>>> running into. >>>>>>>>> >>>>>>>>> Mayur Rustagi >>>>>>>>> Ph: +1 (760) 203 3257 >>>>>>>>> http://www.sigmoidanalytics.com >>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Mar 5, 2014 at 2:40 PM, Koert Kuipers >>>>>>>>> <ko...@tresata.com>wrote: >>>>>>>>> >>>>>>>>>> SparkContext.cancelJobGroup >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Mar 5, 2014 at 5:32 PM, Mayur Rustagi < >>>>>>>>>> mayur.rust...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> How do you cancel the job. Which API do you use? >>>>>>>>>>> >>>>>>>>>>> Mayur Rustagi >>>>>>>>>>> Ph: +1 (760) 203 3257 >>>>>>>>>>> http://www.sigmoidanalytics.com >>>>>>>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 5, 2014 at 2:29 PM, Koert Kuipers <ko...@tresata.com >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> i also noticed that jobs (with a new JobGroupId) which i run >>>>>>>>>>>> after this use which use the same RDDs get very confused. i see >>>>>>>>>>>> lots of >>>>>>>>>>>> cancelled stages and retries that go on forever. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 4, 2014 at 5:02 PM, Koert Kuipers < >>>>>>>>>>>> ko...@tresata.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> i have a running job that i cancel while keeping the spark >>>>>>>>>>>>> context alive. >>>>>>>>>>>>> >>>>>>>>>>>>> at the time of cancellation the active stage is 14. >>>>>>>>>>>>> >>>>>>>>>>>>> i see in logs: >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.DAGScheduler: Asked to >>>>>>>>>>>>> cancel job group 3a25db23-2e39-4497-b7ab-b26b2a976f9c >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: >>>>>>>>>>>>> Cancelling stage 10 >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: >>>>>>>>>>>>> Cancelling stage 14 >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Stage 14 >>>>>>>>>>>>> was cancelled >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: Remove >>>>>>>>>>>>> TaskSet 14.0 from pool x >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: >>>>>>>>>>>>> Cancelling stage 13 >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: >>>>>>>>>>>>> Cancelling stage 12 >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: >>>>>>>>>>>>> Cancelling stage 11 >>>>>>>>>>>>> 2014/03/04 16:43:19 INFO scheduler.TaskSchedulerImpl: >>>>>>>>>>>>> Cancelling stage 15 >>>>>>>>>>>>> >>>>>>>>>>>>> so far it all looks good. then i get a lot of messages like >>>>>>>>>>>>> this: >>>>>>>>>>>>> 2014/03/04 16:43:20 INFO scheduler.TaskSchedulerImpl: Ignoring >>>>>>>>>>>>> update with state FINISHED from TID 883 because its task set is >>>>>>>>>>>>> gone >>>>>>>>>>>>> 2014/03/04 16:43:24 INFO scheduler.TaskSchedulerImpl: Ignoring >>>>>>>>>>>>> update with state KILLED from TID 888 because its task set is gone >>>>>>>>>>>>> >>>>>>>>>>>>> after this stage 14 hangs around in active stages, without any >>>>>>>>>>>>> sign of progress or cancellation. it just sits there forever, >>>>>>>>>>>>> stuck. >>>>>>>>>>>>> looking at the logs of the executors confirms this. they task >>>>>>>>>>>>> seem to be >>>>>>>>>>>>> still running, but nothing is happening. for example (by the time >>>>>>>>>>>>> i look at >>>>>>>>>>>>> this its 4:58 so this tasks hasnt done anything in 15 mins): >>>>>>>>>>>>> >>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for >>>>>>>>>>>>> 943 is 1007 >>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 943 >>>>>>>>>>>>> directly to driver >>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 943 >>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Serialized size of result for >>>>>>>>>>>>> 945 is 1007 >>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Sending result for 945 >>>>>>>>>>>>> directly to driver >>>>>>>>>>>>> 14/03/04 16:43:16 INFO Executor: Finished task ID 945 >>>>>>>>>>>>> 14/03/04 16:43:19 INFO BlockManager: Removing RDD 66 >>>>>>>>>>>>> >>>>>>>>>>>>> not sure what to make of this. any suggestions? best, koert >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >