Hi, Am 28.02.2014 um 00:28 schrieb Andrew Joplin:
> New member here with a couple questions - they're unrelated, so I'll make > separate posts. > > First off, we're runnig grid engine version OGS/GE 2011.11. I recently > finished setting up a hierarchy of three queues - high, medium, and low > priority. Medium is subordinate to high, and low to medium. The queues span > multiple hosts, but are all configured identically except for the > subordination (and a complex that I use to specify which queue to get into). > > For the most part, this works great - I can submit a large number of long > jobs to the low priority queue, and they get suspended whenever someone else > uses the medium priority queue. But the first problem I'm running into is > that occasionally, the suspended jobs don't seem to be restarted. According > to qstat, they have been (status "r"), but when I check the corresponding > process on the execute host, I see a process status "T", as if the SIGCONT > signal was never sent. I can manually send a SIGCONT to the job, and it > finishes processing, but otherwise it does nothing until I notice it (usually > next day). Other times a job will show a status "r" in qstat, but I can't > even find the process on the host it's supposed to be on. > > Has anyone seen this behavior before? I've tried recreating the problem, but > I can't seem to reliably reproduce it. It seems to just happen "sometimes" > when one of my long jobs gets suspended. What can be done investigate it: setting a custom "resume_method" in the queue definition and record whether the it was called or not (therein the SIGCONT needs to be send to the complete process group: kill -CONT -- $1 and parameter $1 is $job_pid from the pseudo variables for these interfaces. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
