First off, we're runnig grid engine version OGS/GE 2011.11. I recently finished setting up a hierarchy of three queues - high, medium, and low priority. Medium is subordinate to high, and low to medium. The queues span multiple hosts, but are all configured identically except for the subordination (and a complex that I use to specify which queue to get into).
For the most part, this works great - I can submit a large number of long jobs to the low priority queue, and they get suspended whenever someone else uses the medium priority queue. But the first problem I'm running into is that occasionally, the suspended jobs don't seem to be restarted. According to qstat, they have been (status "r"), but when I check the corresponding process on the execute host, I see a process status "T", as if the SIGCONT signal was never sent. I can manually send a SIGCONT to the job, and it finishes processing, but otherwise it does nothing until I notice it (usually next day). Other times a job will show a status "r" in qstat, but I can't even find the process on the host it's supposed to be on.
Has anyone seen this behavior before? I've tried recreating the problem, but I can't seem to reliably reproduce it. It seems to just happen "sometimes" when one of my long jobs gets suspended.
Thanks! -- Andrew Joplin
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
