Re: [gridengine users] Rescheduled and held job running zombie in compute node

Reuti Wed, 26 Nov 2014 02:24:09 -0800

Am 26.11.2014 um 08:23 schrieb Guillermo Marco Puche:

> On 26/11/14 00:42, Reuti wrote:
>> Hi,
>> 
>> Am 25.11.2014 um 23:28 schrieb Guillermo Marco Puche:
>> 
>>> I'm experiencing a very weird issue. I've no idea how to deal with it.
>>>     • I've submited multiple jobs ie: job1, job2, job3.
>>>     • Jobs are running in multiple compute nodes
>>>     • I've modified jobs to user hold and then rescheduled
>>>     • Jobs are now in a hqR state in SGE job pool (they're supposed to stay 
>>> there and free their slots and resources in their respective compute nodes)
>>>     • Compute nodes that previously ran this jobs continue to execute the 
>>> job process and consuming resources (I can see them with htop inside 
>>> compute node)
>> But they are gone from `qstat` and not listed twice?
> Nope, they're listed once in qstat.
>> 
>> 
>>> So what's the correct way to pause/restart a job and hold it on SGE pool 
>>> without holding resources?
>> Are these processes still bound to the execd and the shepherd of SGE or did 
>> they jump out of the process tree compared to the time when they were 
>> running initially?
> Yest processes still bound to the execd and the shepherd of SGE.


Which version of SGE are you using? After issuing `qmod -rj <jobid>` they 
should be gone of course.

-- Reuti

>> 
>> Do you use any `trap` inside the job script?
> No trap commands.
>> 
>> -- Reuti
> Regards,
> Guillermo.
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Rescheduled and held job running zombie in compute node

Reply via email to