Re: [gridengine users] compute node crash: expected behavior?

Reuti Thu, 28 Apr 2011 14:43:37 -0700

Am 28.04.2011 um 22:55 schrieb Stuart Barkley:

> On Thu, 28 Apr 2011 at 15:25 -0000, Reuti wrote:
> 
>>> We are seeing that qmaster continues to show the jobs as running,
>>> even after the compute node has restarted and reestablished
>>> communications with qmaster.
>> 
>> Was the node reinstalled and/or the spool directory of the node
>> cleared?
> 
> Ahh, good question.  We may be unusual here.
> 
> These are diskless and stateless compute nodes.  They have a private
> ram disk based local copy of sge_root and /var (along with everything
> else except /home and /scratch).  The spool directory is definitely
> cleared on reboot.


Then SGE doesn't discover that the node crashed and lost its jobs. If the spool 
directory would be there, the job would be removed from the list automatically.

But sure: this behavior could be improved by comparing what the qmaster things 
should be there and what the node just discovered. For now the rebooted node is 
doing it on its own and discovering that the jobs mentioned in its spool 
directory aren't there (if the spool directory survives) and informs the 
qmaster about it.

-- Reuti


> I'll also look at reschedule_unknown and max_unheard.  I'm currently
> still recovering from something else (anyone ever push the big red
> button in a large data center).
> 
> Stuart
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                       --  Daniel Boone
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] compute node crash: expected behavior?

Reply via email to