Am 28.04.2011 um 15:21 schrieb Stuart Barkley:

> What is the expected behavior when a compute node goes down while jobs
> are running?  Although it is hoped that this will not occur, it does
> happen on occasion.
> 
> I expected that qmaster would notice the node as gone and either
> reschedule or cancel jobs.  I could see some caution on qmaster's
> part, in case it was just the communications which was lost, but the
> job was actually still running.
> 
> We are seeing that qmaster continues to show the jobs as running, even
> after the compute node has restarted and reestablished communications
> with qmaster.

Was the node reinstalled and/or the spool directory of the node cleared?

-- Reuti


> Are my expectations off or is there something likely to be
> misconfigured?
> 
> We are running a fairly plain SGE 6.2u5.
> 
> Is there some design documentation about the communications between
> qmaster and compute nodes regarding job control?  Don't need protocol
> bit level description, but an overview of the messaging would be very
> useful.
> 
> Thanks,
> Stuart
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to