Am 04.04.2012 um 18:09 schrieb Lars van der bijl: > in our case the application has no checkpointing capabilities. for us > a reschedule is just run from start on a new host. > > so a checkpoint with a signal 9 should be enough?
No, the signal will be send to create a checkpoint in min_cpu_interval (from the queue definition) intervals, not to do any custom kill. For a reschedule it should be killed by the ususal behavior, it's worth a try whether it working better than the default reschedule killing facility. Don't define any signal for user_defined or transparent checkpointing environments then, as outlined above it will be used for creating checkpoints. -- Reuti > On 4 April 2012 17:50, Reuti <re...@staff.uni-marburg.de> wrote: >> Am 04.04.2012 um 17:42 schrieb Lars van der bijl: >> >>> Hey Reuti >>> >>> On 4 April 2012 17:14, Reuti <re...@staff.uni-marburg.de> wrote: >>>> Well, in both cases it is killed of course. You could set loglevel to >>>> log_info and search the messages file of the qmaster for entries like: >>>> >>>> 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 >>>> rescheduling because: manual/auto rescheduling >>>> 04/04/2012 17:03:07|worker|pc15370|W|rescheduling job 3963.1 >>>> 04/04/2012 17:03:46|worker|pc15370|I|reuti has deleted job 396 >>> >>> might have to rotate the file before i try and do something like that, >>> it's currently 117Mb. >>> >>>> >>>> Then you can act on this. Do you have this often, that you want to >>>> reschedule a job? I wonder whether using a checkpointing environment would >>>> help (also if we don't intend to use any checkpointing at all). There you >>>> can have a procedure for migration in migr_command. >>> >>> no it's not something I want to happen often but it happens. one thing >>> i'm still struggling with on a related note is that a task will keep >>> running even after it is rescheduled. making both of the outputs >>> useless. >>> >>> would we be able to make sure the task is kill -9'd (and it's sub >> >> The default behavior in SGE is: >> >> # kill -9 -- -pid >> >> This will kill the complete process group due to its negative value. The >> problem of surviving kids should have been fixed since 6.2u3 as I found >> recently but sometimes it's still there. >> >> >>> pids) if it's rescheduled using a checkpointing? >> >> In fact: you have to do it on your own. SGE will start the migr_command and >> you have to checkpoint by any means and then kill all processes on your own. >> You can have a look at my Howto: >> >> http://arc.liv.ac.uk/SGE/howto/checkpointing.html >> >> and example5 therein. To reschedule a job would then mean to suspend it from >> the command line which will start the migr_command. >> >> -- Reuti >> >> >>>> -- Reuti >>>> >>>> >>>> Am 04.04.2012 um 16:33 schrieb Lars van der bijl: >>>> >>>>> is there a way to tell the difference? >>>>> >>>>> if i reschedual a job i get these values in the usage file in the epilog >>>>> >>>>> wait_status=3727362 >>>>> exit_status=137 >>>>> signal=9 >>>>> start_time=1333549517 >>>>> end_time=1333549565 >>>>> ru_wallclock=48 >>>>> ru_utime=0.226965 >>>>> ru_stime=0.306953 >>>>> ru_maxrss=5408 >>>>> ru_ixrss=0 >>>>> ru_idrss=0 >>>>> ru_isrss=0 >>>>> ru_minflt=40792 >>>>> ru_majflt=5 >>>>> ru_nswap=0 >>>>> ru_inblock=7992 >>>>> ru_oublock=232 >>>>> ru_msgsnd=0 >>>>> ru_msgrcv=0 >>>>> ru_nsignals=0 >>>>> ru_nvcsw=3489 >>>>> ru_nivcsw=113 >>>>> >>>>> if i kill the job I get this. >>>>> >>>>> wait_status=3727362 >>>>> exit_status=137 >>>>> signal=9 >>>>> start_time=1333549704 >>>>> end_time=1333549719 >>>>> ru_wallclock=15 >>>>> ru_utime=0.196970 >>>>> ru_stime=0.196970 >>>>> ru_maxrss=5412 >>>>> ru_ixrss=0 >>>>> ru_idrss=0 >>>>> ru_isrss=0 >>>>> ru_minflt=40459 >>>>> ru_majflt=0 >>>>> ru_nswap=0 >>>>> ru_inblock=0 >>>>> ru_oublock=232 >>>>> ru_msgsnd=0 >>>>> ru_msgrcv=0 >>>>> ru_nsignals=0 >>>>> ru_nvcsw=705 >>>>> ru_nivcsw=149 >>>>> >>>>> anyone know of a way to tell the difference from the epilog? >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@gridengine.org >>>>> https://gridengine.org/mailman/listinfo/users >>>> >>> >> > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users