We've got a job that was suspended via:
qmod -sj $jobid
that's continuing to run. The job consists of a BASH script, which in
turn submits other jobs in a loop, sleeping for 30 seconds after each loop.
When I examine the job status on the node where it is executing via:
ps -e f | grep $JOBID
I see that the process is sleeping (state "S"), which is not unexpected,
given the 'sleep 30' in the loop, but not suspended (state "T"):
30559 ? SNs 0:02 | \_ /bin/bash
/var/tmp/gridengine/8.1.6/default/spool/node-5-2/job_scripts/2367998
Indeed, the job is not suspended, as it keeps performing the action
inside the loop.
The problem can be consistently reproduced with a trivial job, such as:
------------------------
#! /bin/bash
i=0
while [ $i -le 100 ]
do
date
i=$((i + 1))
sleep 30
done
------------------------
Submitting that job to SGE, then executing 'qmod -sj $jobid' after it
starts does not suspend the running job. The 'qstat' command does show
the job as being in the 's' (suspended) state.
We're not using any custom 'suspend_method' or changing the default
signals sent by SGE.
Jobs that are suspended (due to subordinated queues) by SGE have never
shown this behavior.
Any suggestions about how to proceed with troubleshooting?
Thanks,
Mark
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users