Re: [gridengine users] compute node crash: expected behavior?

Rayson Ho Thu, 28 Apr 2011 06:52:32 -0700

See: reschedule_unknown , max_unheard

http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html


And: rerun

http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html

Rayson



On Thu, Apr 28, 2011 at 9:21 AM, Stuart Barkley <[email protected]> wrote:
> What is the expected behavior when a compute node goes down while jobs
> are running?  Although it is hoped that this will not occur, it does
> happen on occasion.
>
> I expected that qmaster would notice the node as gone and either
> reschedule or cancel jobs.  I could see some caution on qmaster's
> part, in case it was just the communications which was lost, but the
> job was actually still running.
>
> We are seeing that qmaster continues to show the jobs as running, even
> after the compute node has restarted and reestablished communications
> with qmaster.
>
> Are my expectations off or is there something likely to be
> misconfigured?
>
> We are running a fairly plain SGE 6.2u5.
>
> Is there some design documentation about the communications between
> qmaster and compute nodes regarding job control?  Don't need protocol
> bit level description, but an overview of the messaging would be very
> useful.
>
> Thanks,
> Stuart
> --
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] compute node crash: expected behavior?

Reply via email to