Re-reading the man page yet another time made me think that this is the
desired and logical behavior: if the job id remains the same, then h_rt and
s_rt counters cannot be reset: job starts only once, execution *continues*
after re-scheduing:
"RESOURCE LIMITS
The first two resource limit parameters, s_rt and h_rt, are
implemented by Grid Engine. They define the "real time" or also called
"elapsed" or "wall clock" time
*having passed since the start of the job*...'
Ilya.
On Mon, Jun 11, 2018 at 9:57 AM, Reuti <[email protected]> wrote:
>
> > Am 11.06.2018 um 18:43 schrieb Ilya M <[email protected]>:
> >
> > Hello,
> >
> > Thank you for the suggestion, Reuti. Not sure if my users' pipelines can
> deal with multiple job ids, perhaps they will be willing to modify their
> code.
>
> Also other commands in SGE like `qdel` allow to use the job name to deal
> with such a configuration.
>
>
> > On Mon, Jun 11, 2018 at 9:23 AM, Reuti <[email protected]>
> wrote:
> > Hi,
> >
> >
> > I wouldn't be surprised if the execd remembers that the job was already
> warned, hence it must be the hard limit now. Would your workflow allow:
> >
> > This is happening on different nodes, so each execd cannot know any
> history by itself, the master must be providing this information.
>
> Aha, you correct.
>
> -- Reuti
>
>
> > Can't help wondering if this is a configurable option.
> >
> > Ilya.
> >
> >
> >
> > . /usr/sge/default/common/settings.sh
> > trap "qresub $JOB_ID; exit 4;" SIGUSR1
> >
> > Well, you get several job numbers this way. For the accounting with
> `qacct` you could use the job name instead of the job number to get all the
> runs listed though.
> >
> > -- Reuti
> >
> >
> > > This is my test script:
> > >
> > > #!/bin/bash
> > >
> > > #$ -S /bin/bash
> > > #$ -l s_rt=0:0:5,h_rt=0:0:10
> > > #$ -j y
> > >
> > > set -x
> > > set -e
> > > set -o pipefail
> > > set -u
> > >
> > > trap "exit 99" SIGUSR1
> > >
> > > trap "exit 2" SIGTERM
> > >
> > > echo "hello world"
> > >
> > > sleep 15
> > >
> > > It should reschedule itself indefinitely when s_rt lapses. Yet, what
> is happening is that rescheduling happens only once. On the second run the
> job receives only SIGTERM and exits. Here is the script's output:
> > >
> > > node140
> > > + set -e
> > > + set -o pipefail
> > > + set -u
> > > + trap 'exit 99' SIGUSR1
> > > + trap 'exit 2' SIGTERM
> > > + echo 'hello world'
> > > hello world
> > > + sleep 15
> > > User defined signal 1
> > > ++ exit 99
> > > node069
> > > + set -e
> > > + set -o pipefail
> > > + set -u
> > > + trap 'exit 99' SIGUSR1
> > > + trap 'exit 2' SIGTERM
> > > + echo 'hello world'
> > > hello world
> > > + sleep 15
> > > Terminated
> > > ++ exit 2
> > >
> > > Execd logs confirms that for the second time the jobs was killed for
> exceeding h_rt:
> > >
> > > 06/08/2018 21:20:15| main|node140|W|job 8030395.1 exceeded soft
> wallclock time - initiate soft notify method
> > > 06/08/2018 21:20:59| main|node140|E|shepherd of job 8030395.1 exited
> with exit status = 25
> > >
> > > 06/08/2018 21:21:45| main|node069|W|job 8030395.1 exceeded hard
> wallclock time - initiate terminate method
> > >
> > > And here is the accounting information:
> > >
> > > ==============================================================
> > > qname short.q
> > > hostname node140
> > > group everyone
> > > owner ilya
> > > project project.p
> > > department defaultdepartment
> > > jobname reshed_test.sh
> > > jobnumber 8030395
> > > taskid undefined
> > > account sge
> > > priority 0
> > > qsub_time Fri Jun 8 21:19:40 2018
> > > start_time Fri Jun 8 21:20:09 2018
> > > end_time Fri Jun 8 21:20:15 2018
> > > granted_pe NONE
> > > slots 1
> > > failed 25 : rescheduling
> > > exit_status 99
> > > ru_wallclock 6
> > > ...
> > > ==============================================================
> > > qname short.q
> > > hostname node069
> > > group everyone
> > > owner ilya
> > > project project.p
> > > department defaultdepartment
> > > jobname reshed_test.sh
> > > jobnumber 8030395
> > > taskid undefined
> > > account sge
> > > priority 0
> > > qsub_time Fri Jun 8 21:19:40 2018
> > > start_time Fri Jun 8 21:21:39 2018
> > > end_time Fri Jun 8 21:21:50 2018
> > > granted_pe NONE
> > > slots 1
> > > failed 0
> > > exit_status 2
> > > ru_wallclock 11
> > > ...
> > >
> > >
> > > Is there anything in the configuration I could be missing. Running
> 6.2u5.
> > >
> > > Thank you,
> > > Ilya.
> > >
> > > _______________________________________________
> > > users mailing list
> > > [email protected]
> > > https://gridengine.org/mailman/listinfo/users
> >
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> > https://gridengine.org/mailman/listinfo/users
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users