Re: [gridengine users] compute node crash: expected behavior?

Dave Love Thu, 28 Apr 2011 10:49:38 -0700

Stuart Barkley <[email protected]> writes:

> What is the expected behavior when a compute node goes down while jobs
> are running?


Not what it does, and it's definitely a pain.  I think it was mentioned
previously here or on the old list, and it keeps biting us, but I
haven't had time to try to debug it.  Maybe someone else can have a go.

> Although it is hoped that this will not occur, it does
> happen on occasion.
>
> I expected that qmaster would notice the node as gone and either
> reschedule or cancel jobs.  I could see some caution on qmaster's
> part, in case it was just the communications which was lost, but the
> job was actually still running.
>
> We are seeing that qmaster continues to show the jobs as running, even
> after the compute node has restarted and reestablished communications
> with qmaster.

I think you see that iff it was a slave node for the job.  If the master
node dies, things seem sane.  Strangely, I've just logged the result of
one we crashed earlier since there were two jobs on the node, as opposed
to the exclusive access we had previously, showing the difference
<https://arc.liv.ac.uk/trac/SGE/ticket/1283#comment:1>.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] compute node crash: expected behavior?

Reply via email to