Stuart Barkley <stua...@4gh.net> writes: > What is the expected behavior when a compute node goes down while jobs > are running?
Not what it does, and it's definitely a pain. I think it was mentioned previously here or on the old list, and it keeps biting us, but I haven't had time to try to debug it. Maybe someone else can have a go. > Although it is hoped that this will not occur, it does > happen on occasion. > > I expected that qmaster would notice the node as gone and either > reschedule or cancel jobs. I could see some caution on qmaster's > part, in case it was just the communications which was lost, but the > job was actually still running. > > We are seeing that qmaster continues to show the jobs as running, even > after the compute node has restarted and reestablished communications > with qmaster. I think you see that iff it was a slave node for the job. If the master node dies, things seem sane. Strangely, I've just logged the result of one we crashed earlier since there were two jobs on the node, as opposed to the exclusive access we had previously, showing the difference <https://arc.liv.ac.uk/trac/SGE/ticket/1283#comment:1>. _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users