See: reschedule_unknown , max_unheard http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html
And: rerun http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html Rayson On Thu, Apr 28, 2011 at 9:21 AM, Stuart Barkley <stua...@4gh.net> wrote: > What is the expected behavior when a compute node goes down while jobs > are running? Although it is hoped that this will not occur, it does > happen on occasion. > > I expected that qmaster would notice the node as gone and either > reschedule or cancel jobs. I could see some caution on qmaster's > part, in case it was just the communications which was lost, but the > job was actually still running. > > We are seeing that qmaster continues to show the jobs as running, even > after the compute node has restarted and reestablished communications > with qmaster. > > Are my expectations off or is there something likely to be > misconfigured? > > We are running a fairly plain SGE 6.2u5. > > Is there some design documentation about the communications between > qmaster and compute nodes regarding job control? Don't need protocol > bit level description, but an overview of the messaging would be very > useful. > > Thanks, > Stuart > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users