Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

Reuti Thu, 31 Oct 2013 16:27:00 -0700

Hi,

Am 31.10.2013 um 21:24 schrieb Joseph Farran:


> Not sure if there is a better way, but the following seems to be working.
> 
> In the checkpoint scripts, the submit script, I am catching SIGUSR1 signal
> and then issuing a qmod suspend the job with:
> 
> function SIGUSR1_HANDLER()
> {
>    qmod -sj $JOB_ID
> }
> trap SIGUSR1_HANDLER  SIGUSR1
> 
> So when "s_rt" is reached and the job receives SIGUSR1 signal, it suspends
> the job via qmod.

Although this looks fine, I can't get it working. I mean: it's working for the 
first time, but in the second iteration the job is killed directly even if 
there is no h_rt attached at all (or set in the queue definition).

It looks like SGE is checking whether there was any warning already and if so, 
issues directly a SIGKILL - this is on the one hand wrong of course. But it's 
for sure a matter of discussion: is s_rt/h_rt per iteration or for the overall 
job time? (maybe: queue = per interation, resource request = overall time?)

I see only the option to do this outside of SGE and issue once in a while 
`qstatus -r`*) to get the runtime per job and make appropriate measures, i.e. 
execute `qmod -sj <job_id>` as you intended.

-- Reuti

*) It's necessary to make a change to the awk script to get the raw output 
instead the formatted time in the "(relative)" case:

starttime=sprintf("%s", running_seconds)


> Joseph
> 
> 
> On 10/31/2013 11:48 AM, Joseph Farran wrote:
>> Greetings.
>> 
>> We have a queue defined with a soft & hard wall-clock limit of:
>> 
>> qconf -sq free64 | egrep "_rt|notify"
>> notify                00:05:00
>> s_rt                  48:00:00
>> h_rt                  48:05:00
>> 
>> And jobs get killed correctly after 2 days of wall-clock run time. We now 
>> have Grid
>> Engine checkpoint setup and would like to make it so that jobs do not get 
>> killed,
>> but rather be sent the suspend signal so that checkpoint takes over instead 
>> of
>> being killed.
>> 
>> After reading and doing some tests with the queue "suspend_method", I am not
>> sure I am on the right track.
>> 
>> So what is the proper / correct way to do this?    To *not* have jobs killed 
>> but
>> to have the checkpoint process take over when s_rt is reached?
>> 
>> Joseph
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

Reply via email to