Hi all.
I am not sure if this is a bug or the way Grid Engine works.
We have several queues our users submit jobs to. One of the queues
"free64" has a 3-day wall-clock limit:
$ qconf -sq free64 | grep "_rt"
s_rt 72:00:00
h_rt 72:05:00
While other queue "bio" does not:
$ qconf -sq bio | grep "_rt"
s_rt INFINITY
h_rt INFINITY
When a user submits a job to both queues "-q free64,bio", jobs that run
longer than 3 days are killed whether they land on "free64" or "bio"
queue. Why are jobs that land on the "bio" queue being killed after 3
days?
The jobs are also using GE checkpoint restart:
$ qconf -sckpt restart
ckpt_name restart
interface USERDEFINED
ckpt_command NONE
migr_command NONE
restart_command NONE
clean_command none
ckpt_dir $SGE_O_WORKDIR
signal usr1
when xsr
Is it that checkpoint restart the cause of this? I am guessing that a
job that landed first on free64 queue picked-up the 3-days wall-clock
limit and when it is restarted on the bio queue, it inherited the
wall-clock 3-days limit from free64? If this is what is happening, is
this a bug? Is there a workaround?
Joseph
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users