That's a good idea - I'll try that next. But first here's some additional info from the qmaster messages file. I'm running one of my longer jobs (with qmake, and a couple hundred targets) now, and at the same time someone else is running many jobs on the superordinate queue, which is suspending some of my jobs (as it should). I'm waiting to see if any are *not* being restarted, but in the mean time I'm seeing a lot of these messages in qmaster messages:

[...] job failed on host assumedly after job because: can't read usage file for job [...]

and

[...] job failed on host assumedly after job because: job died through signal HUP (1)

Finally, I also see this message:

[...] Jobs 3874441 & 3875448 dispatched to master/subordinated queues [...] Suspend on subordinate to occur in same scheduling interval. Policy conflict!

The later job is my main qmake job.

Andrew Joplin


On 03/01/2014 06:23 AM, Reuti wrote:
Hi,

Am 28.02.2014 um 00:28 schrieb Andrew Joplin:

New member here with a couple questions - they're unrelated, so I'll make 
separate posts.

First off, we're runnig grid engine version OGS/GE 2011.11.  I recently 
finished setting up a hierarchy of three queues - high, medium, and low 
priority.  Medium is subordinate to high, and low to medium.  The queues span 
multiple hosts, but are all configured identically except for the subordination 
(and a complex that I use to specify which queue to get into).

For the most part, this works great - I can submit a large number of long jobs to the low priority queue, and 
they get suspended whenever someone else uses the medium priority queue.  But the first problem I'm running 
into is that occasionally, the suspended jobs don't seem to be restarted.  According to qstat, they have been 
(status "r"), but when I check the corresponding process on the execute host, I see a process 
status "T", as if the SIGCONT signal was never sent.  I can manually send a SIGCONT to the job, and 
it finishes processing, but otherwise it does nothing until I notice it (usually next day).  Other times a 
job will show a status "r" in qstat, but I can't even find the process on the host it's supposed to 
be on.

Has anyone seen this behavior before?  I've tried recreating the problem, but I can't 
seem to reliably reproduce it.  It seems to just happen "sometimes" when one of 
my long jobs gets suspended.
What can be done investigate it: setting a custom "resume_method" in the queue 
definition and record whether the it was called or not (therein the SIGCONT needs to be 
send to the complete process group:

kill -CONT -- $1

and parameter $1 is $job_pid from the pseudo variables for these interfaces.

-- Reuti

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to