Am 01.11.2013 um 19:29 schrieb Reuti:
> Hi,
>
> Am 01.11.2013 um 19:18 schrieb Joseph Farran:
>
>> Yes, after going through the logs, the subsequent restarts are messed up.
>>
>> I've played with it more and there is easy no way to do this inside the job
>> submission script,
>
> Inside the submission script it's possible - I thought you were looking to
> get it implemented in SGE (but the user has to take care of it [i.e. trust
> the users] - or using a "startup_method"):
>
> #!/bin/sh
> . /usr/sge/default/common/settings.sh
> { sleep 172800; qmod -sj $JOB_ID; } &
Interesting, I expected "{ list; }" will save a subshell process, but it looks
like the opposite is true:
"(list) &" will be a direct child of the running bash
"{ list; } &" will create an additional bash instance
Hence, using "(list) &" might be better suited here.
-- Reuti
> ./my_application
>
>
>> so I will have to resort ( as you indicated ) to using outside script to run
>> periodically and do a "qsub -sj job / job.task-id when near the s_rt value.
>>
>> It seems to me that Grid Engine is missing an option in the checkpoint
>> environment to deal when s_rt value has been reached to then trigger the
>> equivalent of a suspension ( "qsub -sj " ).
>
> Yes. I would call it runtime-intervall inside the checkpoint definition or
> so, to distinguish it from s/h_rt.
>
> -- Reuti
>
>
>> Best,
>> Joseph
>>
>> On 10/31/2013 04:23 PM, Reuti wrote:
>>>
>>> Although this looks fine, I can't get it working. I mean: it's working for
>>> the first time, but in the second iteration the job is killed directly even
>>> if there is no h_rt attached at all (or set in the queue definition).
>>>
>>> It looks like SGE is checking whether there was any warning already and if
>>> so, issues directly a SIGKILL - this is on the one hand wrong of course.
>>> But it's for sure a matter of discussion: is s_rt/h_rt per iteration or for
>>> the overall job time? (maybe: queue = per interation, resource request =
>>> overall time?)
>>>
>>> I see only the option to do this outside of SGE and issue once in a while
>>> `qstatus -r`*) to get the runtime per job and make appropriate measures,
>>> i.e. execute `qmod -sj <job_id>` as you intended.
>>>
>>> -- Reuti
>>>
>>> *) It's necessary to make a change to the awk script to get the raw output
>>> instead the formatted time in the "(relative)" case:
>>>
>>> starttime=sprintf("%s", running_seconds)
>>>
>>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users