Hi Reuti.

Yes, I had that already set:

qconf -sconf|fgrep execd_params
execd_params                 ENABLE_ADDGRP_KILL=TRUE

What is strange is that 1 out of 10 jobs or so do get killed just fine when 
they go past the hard wall time clock.

However, the majority of the jobs are not being killed when they go past their 
wall time clock.

 How can I investigate this further?



On 10/30/2012 11:44 AM, Reuti wrote:
Hi,

Am 30.10.2012 um 19:31 schrieb Joseph Farran:

I google this issue but did not see much help on the subject.

I have several queues with hard wall clock limits like this one:

# qconf -sq queue  | grep h_rt
h_rt                  96:00:00

I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall 
clock limit and continue to run.

Look at GE qmaster logs, I see dozens and dozens of these entries:

    10/30/2012 11:23:10|schedu|hpc|W|job 13179.1 should have finished since 
42318s
Maybe they jumped out of the process tree (usually jobs are killed by `kill -9 
-- -pgrp`. You can kill them by their additional group id, which is attached to 
all started processes even if the executed something like `setsid`:

$ qconf -sconf
...
execd_params                 ENABLE_ADDGRP_KILL=TRUE

If it's still not working, we have to investigate the process tree.

HTH - Reuti


These entries correspond to the running jobs that should have ended 96 hours 
ago, but they keep on running.

Why is GE not killing these jobs correctly when they run past the 96 hour limit 
but yet complains they should have ended?






_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to