Re: [gridengine users] compute node crash: expected behavior?

Dave Love Mon, 02 May 2011 06:51:18 -0700

Reuti <re...@staff.uni-marburg.de> writes:

>> These are diskless and stateless compute nodes.  They have a private
>> ram disk based local copy of sge_root and /var (along with everything
>> else except /home and /scratch).  The spool directory is definitely
>> cleared on reboot.


[I'm surprised it's all in RAM, but good if there's enough!]  For what
it's worth, the output I referred to is from stateless nodes that share
nearly everything, including the GE spool, and the symptoms seem
similar.

> Then SGE doesn't discover that the node crashed and lost its jobs. If
> the spool directory would be there, the job would be removed from the
> list automatically.
>
> But sure: this behavior could be improved by comparing what the
> qmaster things should be there and what the node just discovered. For
> now the rebooted node is doing it on its own and discovering that the
> jobs mentioned in its spool directory aren't there (if the spool
> directory survives) and informs the qmaster about it.

What seems to be happening here (on the basis of the logs) is that execd
zaps the spool when it starts up (before or after reporting to
qmaster?), qmaster realizes it needs to kill the job and tells the
rebooted node, which keeps trying unsuccessfully because the job
directory has been clobbered.
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] compute node crash: expected behavior?

Reply via email to