What is the expected behavior when a compute node goes down while jobs
are running?  Although it is hoped that this will not occur, it does
happen on occasion.

I expected that qmaster would notice the node as gone and either
reschedule or cancel jobs.  I could see some caution on qmaster's
part, in case it was just the communications which was lost, but the
job was actually still running.

We are seeing that qmaster continues to show the jobs as running, even
after the compute node has restarted and reestablished communications
with qmaster.

Are my expectations off or is there something likely to be
misconfigured?

We are running a fairly plain SGE 6.2u5.

Is there some design documentation about the communications between
qmaster and compute nodes regarding job control?  Don't need protocol
bit level description, but an overview of the messaging would be very
useful.

Thanks,
Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to