Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

Reuti Wed, 04 Apr 2012 09:32:01 -0700

Am 04.04.2012 um 18:09 schrieb Lars van der bijl:

> in our case the application has no checkpointing capabilities. for us
> a reschedule is just run from start on a new host.
> 
> so a checkpoint with a signal 9 should be enough?


No, the signal will be send to create a checkpoint in min_cpu_interval (from 
the queue definition) intervals, not to do any custom kill. 

For a reschedule it should be killed by the ususal behavior, it's worth a try 
whether it working better than the default reschedule killing facility. Don't 
define any signal for user_defined or transparent checkpointing environments 
then, as outlined above it will be used for creating checkpoints.

-- Reuti


> On 4 April 2012 17:50, Reuti <re...@staff.uni-marburg.de> wrote:
>> Am 04.04.2012 um 17:42 schrieb Lars van der bijl:
>> 
>>> Hey Reuti
>>> 
>>> On 4 April 2012 17:14, Reuti <re...@staff.uni-marburg.de> wrote:
>>>> Well, in both cases it is killed of course. You could set loglevel to 
>>>> log_info and search the messages file of the qmaster for entries like:
>>>> 
>>>> 04/04/2012 17:03:07|worker|pc15370|W|job 3963.1 failed on host pc15370 
>>>> rescheduling because: manual/auto rescheduling
>>>> 04/04/2012 17:03:07|worker|pc15370|W|rescheduling job 3963.1
>>>> 04/04/2012 17:03:46|worker|pc15370|I|reuti has deleted job 396
>>> 
>>> might have to rotate the file before i try and do something like that,
>>> it's currently 117Mb.
>>> 
>>>> 
>>>> Then you can act on this. Do you have this often, that you want to 
>>>> reschedule a job? I wonder whether using a checkpointing environment would 
>>>> help (also if we don't intend to use any checkpointing at all). There you 
>>>> can have a procedure for migration in migr_command.
>>> 
>>> no it's not something I want to happen often but it happens. one thing
>>> i'm still struggling with on a related note is that a task will keep
>>> running even after it is rescheduled. making both of the outputs
>>> useless.
>>> 
>>> would we be able to make sure the task is kill -9'd (and it's sub
>> 
>> The default behavior in SGE is:
>> 
>> # kill -9 -- -pid
>> 
>> This will kill the complete process group due to its negative value. The 
>> problem of surviving kids should have been fixed since 6.2u3 as I found 
>> recently but sometimes it's still there.
>> 
>> 
>>> pids) if it's rescheduled using a checkpointing?
>> 
>> In fact: you have to do it on your own. SGE will start the migr_command and 
>> you have to checkpoint by any means and then kill all processes on your own. 
>> You can have a look at my Howto:
>> 
>> http://arc.liv.ac.uk/SGE/howto/checkpointing.html
>> 
>> and example5 therein. To reschedule a job would then mean to suspend it from 
>> the command line which will start the migr_command.
>> 
>> -- Reuti
>> 
>> 
>>>> -- Reuti
>>>> 
>>>> 
>>>> Am 04.04.2012 um 16:33 schrieb Lars van der bijl:
>>>> 
>>>>> is there a way to tell the difference?
>>>>> 
>>>>> if i reschedual a job i get these values in the usage file in the epilog
>>>>> 
>>>>> wait_status=3727362
>>>>> exit_status=137
>>>>> signal=9
>>>>> start_time=1333549517
>>>>> end_time=1333549565
>>>>> ru_wallclock=48
>>>>> ru_utime=0.226965
>>>>> ru_stime=0.306953
>>>>> ru_maxrss=5408
>>>>> ru_ixrss=0
>>>>> ru_idrss=0
>>>>> ru_isrss=0
>>>>> ru_minflt=40792
>>>>> ru_majflt=5
>>>>> ru_nswap=0
>>>>> ru_inblock=7992
>>>>> ru_oublock=232
>>>>> ru_msgsnd=0
>>>>> ru_msgrcv=0
>>>>> ru_nsignals=0
>>>>> ru_nvcsw=3489
>>>>> ru_nivcsw=113
>>>>> 
>>>>> if i kill the job I get this.
>>>>> 
>>>>> wait_status=3727362
>>>>> exit_status=137
>>>>> signal=9
>>>>> start_time=1333549704
>>>>> end_time=1333549719
>>>>> ru_wallclock=15
>>>>> ru_utime=0.196970
>>>>> ru_stime=0.196970
>>>>> ru_maxrss=5412
>>>>> ru_ixrss=0
>>>>> ru_idrss=0
>>>>> ru_isrss=0
>>>>> ru_minflt=40459
>>>>> ru_majflt=0
>>>>> ru_nswap=0
>>>>> ru_inblock=0
>>>>> ru_oublock=232
>>>>> ru_msgsnd=0
>>>>> ru_msgrcv=0
>>>>> ru_nsignals=0
>>>>> ru_nvcsw=705
>>>>> ru_nivcsw=149
>>>>> 
>>>>> anyone know of a way to tell the difference from the epilog?
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@gridengine.org
>>>>> https://gridengine.org/mailman/listinfo/users
>>>> 
>>> 
>> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] difference between a task reschedule and a task kill in the epilog?

Reply via email to