I was promised the result of http://www.jisc.org.uk/whatwedo/programmes/greeningict/technical/supercomputers.aspx to look at adapting it for SGE. I should chase it up, but I don't know how comprehensive it is. As it was written for PBS, I suspect it may not do too useful things with resource requests, and I think the OSC systems are rather uniform compared with our rather heterogeneous ones.
[Off-topic, I've never seen an assessment of the effects on reliability of treating HPC systems this way, especially already-unreliable ones like ours, where we're suspicious of thermal effects on the motherboard stability. Does anyone know of one?] Stuart Barkley <[email protected]> writes: > On Fri, 29 Apr 2011 at 13:19 -0000, Chris Dagdigian wrote: > >> I have absolutely seen this done with very real results. The most >> important thing is have the system generate emails to senior >> management saying things like "... I saved $12,000 in electricity >> last quarter ..." -- That doesn't always seem to work unless it's switching off PCs or lights, unfortunately... >> I can't overstate enough the importance of >> making sure that you have the PR stuff covered in addition to the >> nice tech stuff under the hood. > > I think some of the desire is to have a "green" check box. In theory, > the cluster will eventually be so busy there won't be any nodes > suitable for powering off. If it isn't heavily loaded, you can typically win significantly just with Powernow-type features (if you can understand the BIOS parameters etc. sufficiently). If CPU frequency-changing is disabled for Infiniband, for instance, you can flip it in the GE prolog/epilog. > I'm fine either way. I have IPMI working and will use that for power > on. I'll probably do a shutdown command through ssh so the nodes go > down cleanly. Not a problem to code. Yes, I don't understand why IPMI isn't scriptable/automatable (modulo the pervasive implementation bugs), but it may be useful to abstract through something like powerman anyhow. Is doing in-band shutdown more reliable than via IPMI/ACPI if you have stateful nodes? > This also starts to interact with other systems which are also trying > to manage the nodes. You don't want monitoring system alarms going > off because you are saving power. Yes, you need hooks into Nagios, or whatever, but how does a database help with that? > This is also where I see mission creep starting to happen as the > "simple database" gains additional functionality. Does it need more than you already have if you run dbwriter with appropriate parameters logged? >> - monitor the length of the pending ("qw" state jobs) to see when >> new nodes need to be powered up > > Suggestions for getting information out of SGE? I was thinking of the > xml outputs instead of parsing the human readable outputs. I've seen > some comments that the xml stuff has been more subject to change over > time than the human readable output. Yes, but even if you have the latest release, the XML can be ill-formed. It's also verbose, and so presumably distinctly less scalable. We do need a decent GE API for such things. > Is this sufficient? For a basic homogeneous load it should be fine, > but I'm worried about the edge cases. Yes, it's definitely not straightforward generally on a system like ours with heterogeneous nodes and very mixed job types. >> - script things so that when nodes are powered up they are by >> default coming up in disabled (state "d") so that they don't take >> jobs on right away Why is it better to do that, rather than just not starting execd? > I'm behind in getting sanity / health checks working. A basic > starting example would be very useful to see. http://code.google.com/p/nodediag/ is one framework you could use at startup, and possibly for later tests with Nagios NRPE, or similar. >> - BEfore you shut down a node, put it into state "d" so that you >> avoid a race condition between a job landing on the node and your >> shutdown command hitting it > > Yes, and recheck the node for jobs after setting it disabled. > Reenable if jobs got in. Races here could play havoc with losing > track of node state. I don't understand this. What's unsafe about submitting a job to do the reboot, assuming it can ensure exclusive access to the node with the exclusive resource or by claiming all the slots? (I soft-stop execd in the job before the reboot command to avoid failed-job mail.) > We are already gathering power usage information and putting it into > ganglia. I need to do something else to better maintaining needed > long term information. rrd files are great. They just are not for > holding long term detailed information. If you can get it into a host value via a load sensor, presumably it could go in the dbwriter database. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
