Am 28.04.2011 um 15:21 schrieb Stuart Barkley: > What is the expected behavior when a compute node goes down while jobs > are running? Although it is hoped that this will not occur, it does > happen on occasion. > > I expected that qmaster would notice the node as gone and either > reschedule or cancel jobs. I could see some caution on qmaster's > part, in case it was just the communications which was lost, but the > job was actually still running. > > We are seeing that qmaster continues to show the jobs as running, even > after the compute node has restarted and reestablished communications > with qmaster.
Was the node reinstalled and/or the spool directory of the node cleared? -- Reuti > Are my expectations off or is there something likely to be > misconfigured? > > We are running a fairly plain SGE 6.2u5. > > Is there some design documentation about the communications between > qmaster and compute nodes regarding job control? Don't need protocol > bit level description, but an overview of the messaging would be very > useful. > > Thanks, > Stuart > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users