On Sun, 8 May 2011 at 09:09 -0000, Dave Love wrote:
> > If you want to perform maintenance on a single node, as soon as
> > possible (e.g. the node has ECC or SMART errors), use qmod -d to
> > disable it, then take it down when all current jobs have finished.
I'm not the original poster, but have been contemplating this problem
for a while now. I have been using the "qmod -d" approach, but
looking for something better. In fact, I think I have a node disabled
which has been ready for a reboot for a while, I need to remember to
check when I get in the office tomorrow.
> One advantage of putting it into a restricted group is that you can
> submit a job to tell you when it's become free, or to reboot it.
Do you have a working reboot (or email sending) job script you can
share? Some of the other scripts you have recently shared have helped
my understanding of things.
Do you need to remove the host from it's original host group or just
add it to the new one?
How do you keep track of the original configuration/state?
How do you keep track of what action is needed for the node? For our
needs, I can see creating two new host groups "reboot" and "maint"
which can keep the basic information. For hosts in "maint" we could
look up an RT ticket (also still a work in progress) when the host
becomes available.
For just a reboot (which is our usual case: new image for stateless
node), can a simple reboot script do the extra work of restoring the
original configuration? Including enabling the host and doing the
reboot before SGE tries to start a job on it?
This has some overlap with the "Green" work I've also been looking at
and is another place where a small amount of external state is needed
to step through the all the necessary state changes.
On our other cluster torque has a short string which can be associated
with a node when it is disabled ('pbsnodes -o -N "yyyy/mm/dd: NEED
REBOOT" nodename'). I've used that to track the reason a node was
disabled and to partially automate the reboot. This is also still a
work in progress.
Is there any way to store additional information with a SGE node? I
think I recently saw something about using a complex string variable.
(See related question about 'complex'.)
Stuart
--
I've never been lost; I was once bewildered for three days, but never lost!
-- Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users