Re: [gridengine users] Rescheduled and held job running zombie in compute node

Guillermo Marco Puche Tue, 25 Nov 2014 23:27:52 -0800

On 26/11/14 00:42, Reuti wrote:

Hi,


Am 25.11.2014 um 23:28 schrieb Guillermo Marco Puche:

I'm experiencing a very weird issue. I've no idea how to deal with it.
        • I've submited multiple jobs ie: job1, job2, job3.
        • Jobs are running in multiple compute nodes
        • I've modified jobs to user hold and then rescheduled
        • Jobs are now in a hqR state in SGE job pool (they're supposed to stay 
there and free their slots and resources in their respective compute nodes)
        • Compute nodes that previously ran this jobs continue to execute the 
job process and consuming resources (I can see them with htop inside compute 
node)

But they are gone from `qstat` and not listed twice?

Nope, they're listed once in qstat.

So what's the correct way to pause/restart a job and hold it on SGE pool 
without holding resources?

Are these processes still bound to the execd and the shepherd of SGE or did 
they jump out of the process tree compared to the time when they were running 
initially?

Yest processes still bound to the execd and the shepherd of SGE.


Do you use any `trap` inside the job script?

No trap commands.


-- Reuti

Regards,
Guillermo.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Rescheduled and held job running zombie in compute node

Reply via email to