Hi,

Am 01.11.2013 um 19:18 schrieb Joseph Farran:

> Yes, after going through the logs, the subsequent restarts are messed up.
> 
> I've played with it more and there is easy no way to do this inside the job 
> submission script,

Inside the submission script it's possible - I thought you were looking to get 
it implemented in SGE (but the user has to take care of it [i.e. trust the 
users] - or using a "startup_method"):

#!/bin/sh
. /usr/sge/default/common/settings.sh
{ sleep 172800; qmod -sj $JOB_ID; } &
./my_application


> so I will have to resort ( as you indicated ) to using outside script to run 
> periodically and do a "qsub -sj  job / job.task-id when near  the s_rt value.
> 
> It seems to me that Grid Engine is missing an option in the checkpoint 
> environment to deal when s_rt value has been reached to then trigger the 
> equivalent of a suspension ( "qsub -sj " ).

Yes. I would call it runtime-intervall inside the checkpoint definition or so, 
to distinguish it from s/h_rt.

-- Reuti


> Best,
> Joseph
> 
> On 10/31/2013 04:23 PM, Reuti wrote:
>> 
>> Although this looks fine, I can't get it working. I mean: it's working for 
>> the first time, but in the second iteration the job is killed directly even 
>> if there is no h_rt attached at all (or set in the queue definition).
>> 
>> It looks like SGE is checking whether there was any warning already and if 
>> so, issues directly a SIGKILL - this is on the one hand wrong of course. But 
>> it's for sure a matter of discussion: is s_rt/h_rt per iteration or for the 
>> overall job time? (maybe: queue = per interation, resource request = overall 
>> time?)
>> 
>> I see only the option to do this outside of SGE and issue once in a while 
>> `qstatus -r`*) to get the runtime per job and make appropriate measures, 
>> i.e. execute `qmod -sj <job_id>` as you intended.
>> 
>> -- Reuti
>> 
>> *) It's necessary to make a change to the awk script to get the raw output 
>> instead the formatted time in the "(relative)" case:
>> 
>> starttime=sprintf("%s", running_seconds)
>> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to