Hi all.

I am not sure if this is a bug or the way Grid Engine works.

We have several queues our users submit jobs to. One of the queues "free64" has a 3-day wall-clock limit:

$ qconf -sq free64 | grep "_rt"
s_rt                  72:00:00
h_rt                  72:05:00

While other queue "bio" does not:

$ qconf -sq bio | grep "_rt"
s_rt                  INFINITY
h_rt                  INFINITY

When a user submits a job to both queues "-q free64,bio", jobs that run longer than 3 days are killed whether they land on "free64" or "bio" queue. Why are jobs that land on the "bio" queue being killed after 3 days?

The jobs are also using GE checkpoint restart:

$ qconf -sckpt restart
ckpt_name          restart
interface          USERDEFINED
ckpt_command       NONE
migr_command       NONE
restart_command    NONE
clean_command      none
ckpt_dir           $SGE_O_WORKDIR
signal             usr1
when               xsr

Is it that checkpoint restart the cause of this? I am guessing that a job that landed first on free64 queue picked-up the 3-days wall-clock limit and when it is restarted on the bio queue, it inherited the wall-clock 3-days limit from free64? If this is what is happening, is this a bug? Is there a workaround?

Joseph
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to