Am 29.04.2011 um 20:03 schrieb Stuart Barkley:

> <snip>
>> - monitor the length of the pending ("qw" state jobs) to see when
>> new nodes need to be powered up
> 
> Suggestions for getting information out of SGE?  I was thinking of the
> xml outputs instead of parsing the human readable outputs.  I've seen
> some comments that the xml stuff has been more subject to change over
> time than the human readable output.
> 
> If just (length > 0) good?
> 
> Or actually wait for a few jobs to be waiting?  Counting array job
> tasks?  Counting mpi job needs?
> 
> Is this sufficient?  For a basic homogeneous load it should be fine,
> but I'm worried about the edge cases.

No, you have to look ahead: how many resources do I need and would an 
additional switched on node fulfill the requirements for a successful 
scheduling? It wouldn't make any sense to switch on one node because you have a 
waiting job to only reserve the slots thereon when you need even more slots for 
the waiting job.

It should even cover advance scheduling, as it may be necessary to switch on 
some nodes just at the right time for which they were reserved.

In some way to have to simulate the decisions of the scheduler in beforehand. 
AFAIK Megware from Chemnitz having something for such a green cluster setup 
which works with SGE, LSF or Torque. I don't know any details or whether it 
will work for you, but you can contact them at: http://www.megware.com/

-- Reuti


>> - script things so that when nodes are powered up they are by
>> default coming up in disabled (state "d") so that they don't take
>> jobs on right away
> 
> Good point.
> 
>> - each node that boots up needs to run a series of sanity tests
>> designed to protect against common startup failures (missing NFS
>> mounts, etc) that could kill jobs. Running the sanity check script
>> remotely via a passwordless SSH command seems to work and lets you
>> report state/status back into your tracking database
> 
> I'm behind in getting sanity / health checks working.  A basic
> starting example would be very useful to see.
> 
>> - only after the powered on node passes its sanity check do you
>> switch the node away from disabled state "d" so it can start taking
>> on work
>> 
>> - BEfore you shut down a node, put it into state "d" so that you
>> avoid a race condition between a job landing on the node and your
>> shutdown command hitting it
> 
> Yes, and recheck the node for jobs after setting it disabled.
> Reenable if jobs got in.  Races here could play havoc with losing
> track of node state.
> 
>> - Track your up/down actions in enough detail so you can create
>> reports showing how much power you have saved. Senior managers love
>> this stuff
> 
> We are already gathering power usage information and putting it into
> ganglia.  I need to do something else to better maintaining needed
> long term information.  rrd files are great.  They just are not for
> holding long term detailed information.
> 
>> I tried to get the people who wrote this system to turn it into a
>> product and they were cool with it. The big company they worked for
>> was also cool with it but we never went all that far because the
>> effort of doing the legal stuff required to allow this code to leave
>> the big company was basically "too much work" at the time.
> 
> Is it really a lot of code?  Yes, there are a lot of details to handle
> to get things right for all the edge cases.
> 
> Making the code open source could also be just as hard for the "legal
> stuff", but the company wouldn't need to worry about setting up sales
> vehicles, supplying support or anything else.
> 
> I hope to make anything I come up with available for others to use.
> It won't be general purpose and would need someone skilled to adapt it
> to their environment.  This is what I'm looking for now, something
> good to crib off of.
> 
> Thanks,
> Stuart
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to