I was promised the result of
http://www.jisc.org.uk/whatwedo/programmes/greeningict/technical/supercomputers.aspx
to look at adapting it for SGE.  I should chase it up, but I don't know
how comprehensive it is.  As it was written for PBS, I suspect it may
not do too useful things with resource requests, and I think the OSC
systems are rather uniform compared with our rather heterogeneous ones.

[Off-topic, I've never seen an assessment of the effects on reliability
of treating HPC systems this way, especially already-unreliable ones
like ours, where we're suspicious of thermal effects on the motherboard
stability.  Does anyone know of one?]

Stuart Barkley <[email protected]> writes:

> On Fri, 29 Apr 2011 at 13:19 -0000, Chris Dagdigian wrote:
>
>> I have absolutely seen this done with very real results. The most
>> important thing is have the system generate emails to senior
>> management saying things like "... I saved $12,000 in electricity
>> last quarter ..." --

That doesn't always seem to work unless it's switching off PCs or
lights, unfortunately...

>> I can't overstate enough the importance of
>> making sure that you have the PR stuff covered in addition to the
>> nice tech stuff under the hood.
>
> I think some of the desire is to have a "green" check box.  In theory,
> the cluster will eventually be so busy there won't be any nodes
> suitable for powering off.

If it isn't heavily loaded, you can typically win significantly just
with Powernow-type features (if you can understand the BIOS parameters
etc. sufficiently).  If CPU frequency-changing is disabled for
Infiniband, for instance, you can flip it in the GE prolog/epilog.

> I'm fine either way.  I have IPMI working and will use that for power
> on.  I'll probably do a shutdown command through ssh so the nodes go
> down cleanly.  Not a problem to code.

Yes, I don't understand why IPMI isn't scriptable/automatable (modulo
the pervasive implementation bugs), but it may be useful to abstract
through something like powerman anyhow.  Is doing in-band shutdown more
reliable than via IPMI/ACPI if you have stateful nodes?

> This also starts to interact with other systems which are also trying
> to manage the nodes.  You don't want monitoring system alarms going
> off because you are saving power.

Yes, you need hooks into Nagios, or whatever, but how does a database
help with that?

> This is also where I see mission creep starting to happen as the
> "simple database" gains additional functionality.

Does it need more than you already have if you run dbwriter with
appropriate parameters logged?

>> - monitor the length of the pending ("qw" state jobs) to see when
>> new nodes need to be powered up
>
> Suggestions for getting information out of SGE?  I was thinking of the
> xml outputs instead of parsing the human readable outputs.  I've seen
> some comments that the xml stuff has been more subject to change over
> time than the human readable output.

Yes, but even if you have the latest release, the XML can be ill-formed.
It's also verbose, and so presumably distinctly less scalable.  We do
need a decent GE API for such things.

> Is this sufficient?  For a basic homogeneous load it should be fine,
> but I'm worried about the edge cases.

Yes, it's definitely not straightforward generally on a system like ours
with heterogeneous nodes and very mixed job types.

>> - script things so that when nodes are powered up they are by
>> default coming up in disabled (state "d") so that they don't take
>> jobs on right away

Why is it better to do that, rather than just not starting execd?

> I'm behind in getting sanity / health checks working.  A basic
> starting example would be very useful to see.

http://code.google.com/p/nodediag/ is one framework you could use at
startup, and possibly for later tests with Nagios NRPE, or similar.

>> - BEfore you shut down a node, put it into state "d" so that you
>> avoid a race condition between a job landing on the node and your
>> shutdown command hitting it
>
> Yes, and recheck the node for jobs after setting it disabled.
> Reenable if jobs got in.  Races here could play havoc with losing
> track of node state.

I don't understand this.  What's unsafe about submitting a job to do the
reboot, assuming it can ensure exclusive access to the node with the
exclusive resource or by claiming all the slots?   (I soft-stop execd in
the job before the reboot command to avoid failed-job mail.)

> We are already gathering power usage information and putting it into
> ganglia.  I need to do something else to better maintaining needed
> long term information.  rrd files are great.  They just are not for
> holding long term detailed information.

If you can get it into a host value via a load sensor, presumably it
could go in the dbwriter database.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to