Dear all, We are having a problem as described in the subject on gridengine 2011.11.
Some processes finish their execution but they still appear as running in the queue, and they keep consuming their slot. I have been looking for the source of this problem and this is what I have found so far: In the execution host that executed this process, there is no shepperd for this process and the trace file (which is deleted unless you set exec_params keep_active=true) in the <host_spool>/active_jobs/<jobid> is like the one I have attached. The only common thing I have found is that there is a wait3 returned -1 in the trace file that sets some kill command to be performed. As shown in the trace the process finish "correctly" but the <host_spool>/messages start showing: 11/25/2013 07:16:00| main|xxx012|W|job 312363.9 exceeded hard wallclock time - initiate terminate method 11/25/2013 07:16:00| main|xxx012|W|failed to deliver signal 20 to job 312363.9 for KILL (shepherd with pid 18734): No such file or directory until the process is deleted with "-f". In the <qmaster spool>/messages there are references to this jobs as: 11/25/2013 10:11:41|schedu|mainnode|W|job 312363.9 should have finished since 10483s Do you have any hint of what can be problem? Thanks in advance, -- NiCo
trace
Description: Binary data
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users