What is the expected behavior when a compute node goes down while jobs are running? Although it is hoped that this will not occur, it does happen on occasion.
I expected that qmaster would notice the node as gone and either reschedule or cancel jobs. I could see some caution on qmaster's part, in case it was just the communications which was lost, but the job was actually still running. We are seeing that qmaster continues to show the jobs as running, even after the compute node has restarted and reestablished communications with qmaster. Are my expectations off or is there something likely to be misconfigured? We are running a fairly plain SGE 6.2u5. Is there some design documentation about the communications between qmaster and compute nodes regarding job control? Don't need protocol bit level description, but an overview of the messaging would be very useful. Thanks, Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users