Am 03.03.2014 um 19:28 schrieb Andrew Joplin:

> That's a good idea - I'll try that next.  But first here's some additional 
> info from the qmaster messages file.  I'm running one of my longer jobs (with 
> qmake, and a couple hundred targets) now, and at the same time someone else 
> is running many jobs on the superordinate queue, which is suspending some of 
> my jobs (as it should).  I'm waiting to see if any are *not* being restarted, 
> but in the mean time I'm seeing a lot of these messages in qmaster messages:
> 
> [...] job failed on host assumedly after job because: can't read usage file 
> for job [...]
> 
> and
> 
> [...] job failed on host assumedly after job because: job died through signal 
> HUP (1)

Ui - this is something I never saw before, as I'm not aware that SGE will send 
SIGHUP anywhere by default.

Did you redefine the sent signals?

-- Reuti


> Finally, I also see this message:
> 
> [...] Jobs 3874441 & 3875448 dispatched to master/subordinated queues [...]  
> Suspend on subordinate to occur in same scheduling interval.  Policy conflict!
> 
> The later job is my main qmake job.
> 
> Andrew Joplin
> 
> 
> On 03/01/2014 06:23 AM, Reuti wrote:
>> Hi,
>> 
>> Am 28.02.2014 um 00:28 schrieb Andrew Joplin:
>> 
>>> New member here with a couple questions - they're unrelated, so I'll make 
>>> separate posts.
>>> 
>>> First off, we're runnig grid engine version OGS/GE 2011.11.  I recently 
>>> finished setting up a hierarchy of three queues - high, medium, and low 
>>> priority.  Medium is subordinate to high, and low to medium.  The queues 
>>> span multiple hosts, but are all configured identically except for the 
>>> subordination (and a complex that I use to specify which queue to get into).
>>> 
>>> For the most part, this works great - I can submit a large number of long 
>>> jobs to the low priority queue, and they get suspended whenever someone 
>>> else uses the medium priority queue.  But the first problem I'm running 
>>> into is that occasionally, the suspended jobs don't seem to be restarted.  
>>> According to qstat, they have been (status "r"), but when I check the 
>>> corresponding process on the execute host, I see a process status "T", as 
>>> if the SIGCONT signal was never sent.  I can manually send a SIGCONT to the 
>>> job, and it finishes processing, but otherwise it does nothing until I 
>>> notice it (usually next day).  Other times a job will show a status "r" in 
>>> qstat, but I can't even find the process on the host it's supposed to be on.
>>> 
>>> Has anyone seen this behavior before?  I've tried recreating the problem, 
>>> but I can't seem to reliably reproduce it.  It seems to just happen 
>>> "sometimes" when one of my long jobs gets suspended.
>> What can be done investigate it: setting a custom "resume_method" in the 
>> queue definition and record whether the it was called or not (therein the 
>> SIGCONT needs to be send to the complete process group:
>> 
>> kill -CONT -- $1
>> 
>> and parameter $1 is $job_pid from the pseudo variables for these interfaces.
>> 
>> -- Reuti
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to