Hi,
Am 31.10.2013 um 21:24 schrieb Joseph Farran:
> Not sure if there is a better way, but the following seems to be working.
>
> In the checkpoint scripts, the submit script, I am catching SIGUSR1 signal
> and then issuing a qmod suspend the job with:
>
> function SIGUSR1_HANDLER()
> {
> qmod -sj $JOB_ID
> }
> trap SIGUSR1_HANDLER SIGUSR1
>
> So when "s_rt" is reached and the job receives SIGUSR1 signal, it suspends
> the job via qmod.
Although this looks fine, I can't get it working. I mean: it's working for the
first time, but in the second iteration the job is killed directly even if
there is no h_rt attached at all (or set in the queue definition).
It looks like SGE is checking whether there was any warning already and if so,
issues directly a SIGKILL - this is on the one hand wrong of course. But it's
for sure a matter of discussion: is s_rt/h_rt per iteration or for the overall
job time? (maybe: queue = per interation, resource request = overall time?)
I see only the option to do this outside of SGE and issue once in a while
`qstatus -r`*) to get the runtime per job and make appropriate measures, i.e.
execute `qmod -sj <job_id>` as you intended.
-- Reuti
*) It's necessary to make a change to the awk script to get the raw output
instead the formatted time in the "(relative)" case:
starttime=sprintf("%s", running_seconds)
> Joseph
>
>
> On 10/31/2013 11:48 AM, Joseph Farran wrote:
>> Greetings.
>>
>> We have a queue defined with a soft & hard wall-clock limit of:
>>
>> qconf -sq free64 | egrep "_rt|notify"
>> notify 00:05:00
>> s_rt 48:00:00
>> h_rt 48:05:00
>>
>> And jobs get killed correctly after 2 days of wall-clock run time. We now
>> have Grid
>> Engine checkpoint setup and would like to make it so that jobs do not get
>> killed,
>> but rather be sent the suspend signal so that checkpoint takes over instead
>> of
>> being killed.
>>
>> After reading and doing some tests with the queue "suspend_method", I am not
>> sure I am on the right track.
>>
>> So what is the proper / correct way to do this? To *not* have jobs killed
>> but
>> to have the checkpoint process take over when s_rt is reached?
>>
>> Joseph
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users