Hi Reuti.
Yes, I had that already set:
qconf -sconf|fgrep execd_params
execd_params ENABLE_ADDGRP_KILL=TRUE
What is strange is that 1 out of 10 jobs or so do get killed just fine when
they go past the hard wall time clock.
However, the majority of the jobs are not being killed when they go past their
wall time clock.
How can I investigate this further?
On 10/30/2012 11:44 AM, Reuti wrote:
Hi,
Am 30.10.2012 um 19:31 schrieb Joseph Farran:
I google this issue but did not see much help on the subject.
I have several queues with hard wall clock limits like this one:
# qconf -sq queue | grep h_rt
h_rt 96:00:00
I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall
clock limit and continue to run.
Look at GE qmaster logs, I see dozens and dozens of these entries:
10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished since
42318s
Maybe they jumped out of the process tree (usually jobs are killed by `kill -9
-- -pgrp`. You can kill them by their additional group id, which is attached to
all started processes even if the executed something like `setsid`:
$ qconf -sconf
...
execd_params ENABLE_ADDGRP_KILL=TRUE
If it's still not working, we have to investigate the process tree.
HTH - Reuti
These entries correspond to the running jobs that should have ended 96 hours
ago, but they keep on running.
Why is GE not killing these jobs correctly when they run past the 96 hour limit
but yet complains they should have ended?
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users