On Thu, 28 Apr 2011 at 15:25 -0000, Reuti wrote:

> > We are seeing that qmaster continues to show the jobs as running,
> > even after the compute node has restarted and reestablished
> > communications with qmaster.
>
> Was the node reinstalled and/or the spool directory of the node
> cleared?

Ahh, good question.  We may be unusual here.

These are diskless and stateless compute nodes.  They have a private
ram disk based local copy of sge_root and /var (along with everything
else except /home and /scratch).  The spool directory is definitely
cleared on reboot.

I'll also look at reschedule_unknown and max_unheard.  I'm currently
still recovering from something else (anyone ever push the big red
button in a large data center).

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to