Dear all,

We are having a problem as described in the subject on gridengine 2011.11. 

Some processes finish their execution but they still appear as running in the
queue, and they keep consuming their slot. I have been looking for the source
of this problem and this is what I have found so far:

In the execution host that executed this process, there is no shepperd for this
process and the trace file (which is deleted unless you set exec_params
keep_active=true) in the <host_spool>/active_jobs/<jobid> is like the one I have
attached. The only common thing I have found is that there is a

wait3 returned -1

in the trace file that sets some kill command to be performed. As shown in the 
trace the process finish "correctly" but the <host_spool>/messages start 
showing:

11/25/2013 07:16:00|  main|xxx012|W|job 312363.9 exceeded hard wallclock time - 
initiate terminate method
11/25/2013 07:16:00|  main|xxx012|W|failed to deliver signal 20 to job 312363.9 
for KILL (shepherd with pid 18734): No such file or directory

until the process is deleted with "-f".

In the <qmaster spool>/messages there are references to this jobs as:

11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished since 
10483s

Do you have any hint of what can be problem?

Thanks in advance,

-- 
NiCo

Attachment: trace
Description: Binary data

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to