I was promised the result of
to look at adapting it for SGE.  I should chase it up, but I don't know
how comprehensive it is.  As it was written for PBS, I suspect it may
not do too useful things with resource requests, and I think the OSC
systems are rather uniform compared with our rather heterogeneous ones.

[Off-topic, I've never seen an assessment of the effects on reliability
of treating HPC systems this way, especially already-unreliable ones
like ours, where we're suspicious of thermal effects on the motherboard
stability.  Does anyone know of one?]

If you look at the approved power-down cycles which the vendors publish for state-of-the-art HW then that may be scary. It usually is in the order of 2000 cycles. It certainly would cause problems to power down a resource several times a day.

The better option would be to switch the systems into energy saving modes - if the system provides something like that. Not all do.

That said, I know customer cases who have done powersaving with the brute force shutdown approach. There is a presentation by s+c about it which was held during the last Grid Engine workshop. Have the on-line proceedings been archived and made available somewhere?



I have absolutely seen this done with very real results. The most
important thing is have the system generate emails to senior
management saying things like "... I saved $12,000 in electricity
last quarter ..." --
That doesn't always seem to work unless it's switching off PCs or
lights, unfortunately...

I can't overstate enough the importance of
making sure that you have the PR stuff covered in addition to the
nice tech stuff under the hood.
I think some of the desire is to have a "green" check box.  In theory,
the cluster will eventually be so busy there won't be any nodes
suitable for powering off.
If it isn't heavily loaded, you can typically win significantly just
with Powernow-type features (if you can understand the BIOS parameters
etc. sufficiently).  If CPU frequency-changing is disabled for
Infiniband, for instance, you can flip it in the GE prolog/epilog.

I'm fine either way.  I have IPMI working and will use that for power
on.  I'll probably do a shutdown command through ssh so the nodes go
down cleanly.  Not a problem to code.
Yes, I don't understand why IPMI isn't scriptable/automatable (modulo
the pervasive implementation bugs), but it may be useful to abstract
through something like powerman anyhow.  Is doing in-band shutdown more
reliable than via IPMI/ACPI if you have stateful nodes?

This also starts to interact with other systems which are also trying
to manage the nodes.  You don't want monitoring system alarms going
off because you are saving power.
Yes, you need hooks into Nagios, or whatever, but how does a database
help with that?

This is also where I see mission creep starting to happen as the
"simple database" gains additional functionality.
Does it need more than you already have if you run dbwriter with
appropriate parameters logged?

- monitor the length of the pending ("qw" state jobs) to see when
new nodes need to be powered up
Suggestions for getting information out of SGE?  I was thinking of the
xml outputs instead of parsing the human readable outputs.  I've seen
some comments that the xml stuff has been more subject to change over
time than the human readable output.
Yes, but even if you have the latest release, the XML can be ill-formed.
It's also verbose, and so presumably distinctly less scalable.  We do
need a decent GE API for such things.

Is this sufficient?  For a basic homogeneous load it should be fine,
but I'm worried about the edge cases.
Yes, it's definitely not straightforward generally on a system like ours
with heterogeneous nodes and very mixed job types.

- script things so that when nodes are powered up they are by
default coming up in disabled (state "d") so that they don't take
jobs on right away
Why is it better to do that, rather than just not starting execd?

I'm behind in getting sanity / health checks working.  A basic
starting example would be very useful to see. is one framework you could use at
startup, and possibly for later tests with Nagios NRPE, or similar.

- BEfore you shut down a node, put it into state "d" so that you
avoid a race condition between a job landing on the node and your
shutdown command hitting it
Yes, and recheck the node for jobs after setting it disabled.
Reenable if jobs got in.  Races here could play havoc with losing
track of node state.
I don't understand this.  What's unsafe about submitting a job to do the
reboot, assuming it can ensure exclusive access to the node with the
exclusive resource or by claiming all the slots?   (I soft-stop execd in
the job before the reboot command to avoid failed-job mail.)

We are already gathering power usage information and putting it into
ganglia.  I need to do something else to better maintaining needed
long term information.  rrd files are great.  They just are not for
holding long term detailed information.
If you can get it into a host value via a load sensor, presumably it
could go in the dbwriter database.
