On Thu, 28 Apr 2011 at 15:25 -0000, Reuti wrote: > > We are seeing that qmaster continues to show the jobs as running, > > even after the compute node has restarted and reestablished > > communications with qmaster. > > Was the node reinstalled and/or the spool directory of the node > cleared?
Ahh, good question. We may be unusual here. These are diskless and stateless compute nodes. They have a private ram disk based local copy of sge_root and /var (along with everything else except /home and /scratch). The spool directory is definitely cleared on reboot. I'll also look at reschedule_unknown and max_unheard. I'm currently still recovering from something else (anyone ever push the big red button in a large data center). Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users