Am 03.03.2014 um 12:59 schrieb Tina Friedrich: > I was about to ask a similar question; we have the same sort of setup - high, > medium and low priority queues - and run into the same problem. Doesn't > happen all the time, but occasionally a job will simply still sit there > suspended although it should've gotten an SIGCONT.
Were the signals recorded as being sent? Like: 03/03/2014 13:44:52| main|pc15370|I|SIGNAL jid: 11254 jatask: 1 signal: STOP 03/03/2014 13:44:57| main|pc15370|I|SIGNAL jid: 11254 jatask: 1 signal: CONT -- Reuti > Tina > > On 01/03/14 12:23, Reuti wrote: >> Hi, >> >> Am 28.02.2014 um 00:28 schrieb Andrew Joplin: >> >>> New member here with a couple questions - they're unrelated, so I'll make >>> separate posts. >>> >>> First off, we're runnig grid engine version OGS/GE 2011.11. I recently >>> finished setting up a hierarchy of three queues - high, medium, and low >>> priority. Medium is subordinate to high, and low to medium. The queues >>> span multiple hosts, but are all configured identically except for the >>> subordination (and a complex that I use to specify which queue to get into). >>> >>> For the most part, this works great - I can submit a large number of long >>> jobs to the low priority queue, and they get suspended whenever someone >>> else uses the medium priority queue. But the first problem I'm running >>> into is that occasionally, the suspended jobs don't seem to be restarted. >>> According to qstat, they have been (status "r"), but when I check the >>> corresponding process on the execute host, I see a process status "T", as >>> if the SIGCONT signal was never sent. I can manually send a SIGCONT to the >>> job, and it finishes processing, but otherwise it does nothing until I >>> notice it (usually next day). Other times a job will show a status "r" in >>> qstat, but I can't even find the process on the host it's supposed to be on. >>> >>> Has anyone seen this behavior before? I've tried recreating the problem, >>> but I can't seem to reliably reproduce it. It seems to just happen >>> "sometimes" when one of my long jobs gets suspended. >> >> What can be done investigate it: setting a custom "resume_method" in the >> queue definition and record whether the it was called or not (therein the >> SIGCONT needs to be send to the complete process group: >> >> kill -CONT -- $1 >> >> and parameter $1 is $job_pid from the pseudo variables for these interfaces. >> >> -- Reuti >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > > > -- > Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd > Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 > > -- > This e-mail and any attachments may contain confidential, copyright and or > privileged material, and are for the use of the intended addressee only. If > you are not the intended addressee or an authorised recipient of the > addressee please notify us of receipt by returning the e-mail and do not use, > copy, retain, distribute or disclose the information in or attached to the > e-mail. > Any opinions expressed within this e-mail are those of the individual and not > necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot > guarantee that this e-mail or any attachments are free from viruses and we > cannot accept liability for any damage which you may sustain as a result of > software viruses which may be transmitted in or with the message. > Diamond Light Source Limited (company no. 4375679). Registered in England and > Wales with its registered office at Diamond House, Harwell Science and > Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom > > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
