Re: [gridengine users] How to take a node offline for maintenance?

Stuart Barkley Sun, 08 May 2011 19:04:35 -0700

On Sun, 8 May 2011 at 17:50 -0000, Dave Love wrote:

[some reordering by stuartb]


[scheduling reboots]
> > Do you have a working reboot (or email sending) job script you can
> > share?
>
> Not that would be useful, as it depends on your configuration, and I
> need to tidy ours up anyway.

Actually there was one with your other files (submit-reboot-job).

> The only tricky thing is using the right qsub parameters to get
> exclusive access to the node, either by submitting a parallel job
> that uses all the slots, or exclusive=true.  I need to make a
> specific admin queue across all the nodes.  You may want to worry
> about reservation and bumping up the priority too.

This is where things do seem to start to get more complex and require
specific site setup to support.

> and the guts of my reboot job are just
>
>   /usr/bin/sudo /sbin/service sgeexecd softstop
>   /usr/bin/sudo /sbin/reboot

Also where things can get tricky:  If you stop sgeexecd does the
script keep running long enough to do the reboot?  Does SGE see the
job finish before the node reboots?  Does SGE see the jobs die due to
the reboot (hopefully it doesn't restart the reboot job).

> > For just a reboot (which is our usual case: new image for
> > stateless node), can a simple reboot script do the extra work of
> > restoring the original configuration?  Including enabling the host
> > and doing the reboot before SGE tries to start a job on it?
>
> I don't understand the problem.  I can normally modify the
> production image and then just reboot, but to flip images I use
> pxeconfig.

We are still in process of developing our compute image.  Every couple
of weeks (becoming less often) I need to do a global reboot across
100+ SGE nodes.

Right now I disable all the hosts.  Periodically I look for ones which
have drained, do the reboot and then enable the host.  This can take
several days to get through all hosts.

A reboot job would make life much easier.  This is where previous
experience helps so that people don't need to keep learning the same
things.  This is also one of the operationally important things
missing from most of the queuing systems I've looked at.

Also, I occasionally boot a few of our SGE compute nodes with other
non-SGE images.

[disabling a node from taking future jobs]
> > Do you need to remove the host from it's original host group or
> > just add it to the new one?
>
> If you're talking about restricting a node to admin access, what
> context is missing for the scripts I referred to before?  They might
> well need more comments.

I still need to study these much more closely.

The following two lines (occur several places) are probably site
specific.  Is 'compute' a SGE attribute or something to do with a site
'genders?' system?

    # Assumes a genders setup where the compute nodes have property `compute'.

    nodeprefix=$(expr $(nodeattr -q compute) : '\([^[0-9]*\)')

In sge-restrict-nodes you just add the node to hostgroup 'testing' and
otherwise leave the rest of the configuration unchanged.  The script
also shows the configured 'limit' statement which looks to override
all other access.

I need to study 'limit' a lot more to understand all the functions and
interactions.

> > How do you keep track of the original configuration/state?
>
> Sorry, I don't understand that.

Not applicable with your method.  I was thinking it was necessary to
remove the host from its original hostgroups, it would be necessary to
restore it at the end.

> > On our other cluster torque has a short string which can be
> > associated with a node when it is disabled ('pbsnodes -o -N
> > "yyyy/mm/dd: NEED REBOOT" nodename').
>
> My sge-restrict-nodes has a --reason argument for that.

It isn't in the sge-restrict-nodes on the public web site, but I know
I've seen something like that for SGE very recently, but can't find it
now.

Thanks for the information.  I have a lot to digest.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to take a node offline for maintenance?

Reply via email to