On Thu, 28 Apr 2011 at 15:25 -0000, Reuti wrote:
> > We are seeing that qmaster continues to show the jobs as running,
> > even after the compute node has restarted and reestablished
> > communications with qmaster.
>
> Was the node reinstalled and/or the spool directory of the node
> cleared?
Ahh, good question. We may be unusual here.
These are diskless and stateless compute nodes. They have a private
ram disk based local copy of sge_root and /var (along with everything
else except /home and /scratch). The spool directory is definitely
cleared on reboot.
I'll also look at reschedule_unknown and max_unheard. I'm currently
still recovering from something else (anyone ever push the big red
button in a large data center).
Stuart
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users