Am 16.03.11 16:10, Dave Love wrote:
Fritz Ferstl<[email protected]>  writes:

It should actually be quite easy. In a first implementation you'll
probably want to introduce a qmaster_param "black_hole_exit_rate" and
then keep a statistic of the exit frequency rate for each exec host near
the code where qmaster receives job completion information. Qmaster
would compare the exit frequency of the hosts against
black_hole_exit_rate and disable a host if its exit frequency is higher
than allowed by black_hole_exit_rate.

A more advanced implementation would provide a black_hole_exit_rate per
exec host or even per cluster queue (i.e. per job class) plus per exec
host. The checking itself won't get much more complicated. The problem
with the more advanced approach is only that you'd have to modify the
format of the queue and host configuration. This would make that version
incompatible with earlier versions. So the upgrade step will get more
"involved".

Understanding this might be more generally useful.

What's the reason for doing it in the qmaster?  The way I'd hope to be
able to do it would be locally in execd, with and error state triggered
if the rate exceeded that specified by a complex (or more than one).  I
don't know enough about the architecture, though, and maybe the execd
doesn't have access to the relevant information, for one thing.

It's simply a matter or race conditions and getting to sanity quicker. If you do it in the execd then during the time you figure things out there and while you report the situation back to the qmaster, the qmaster will keep sending jobs. Of course, you can catch those in qmaster and send them back as having been unable to run them but there's probably no other way to do that than using exit code 99 feature and that could have undesired side effects (such as where in the pending jobs list those jobs get returned into).

Doing it in the qmaster is certainly cleaner and it shouldn't be any more complicated. As long as it is strictly exit-rate-based, the qmaster is the right place. If you wanted to be smarter and analyze exactly why a job has failed and make a decision to call it a blackhole dependent on that then there'd need to be code in the execd.

If this was implemented, it might be useful to try to ignore classes of
error that seemed to be due to the user to balance the risks between
losing jobs and knocking out the whole cluster.  When there's a large
array job -- or a big batch of jobs which should be an array job -- it's
easy to knock out the cluster with some simple mistake already, as at
least one sort of user error can put the queue in an error state
(probably when the working directory disappears, but I can't remember
off-hand).

Yep, that's a well known problem. Off the top of my head I don't recall either and neither do I recall whether it might have been fixed but care needs to be taken there.

Cheers,

Fritz


--
Fritz Ferstl   --   CTO and Business Development, EMEA
Univa   --   The Data Center Optimization Company
E-Mail: [email protected]
Web: http://www.univa.com
Phone: +49.9471.200.195
Mobile: +49.170.819.7390


---------------------------------------------------------------------


Notice from Univa Postmaster:


This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information. Any unauthorized review, use, 
disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message. This message has been content scanned by the Univa Mail 
system.



---------------------------------------------------------------------

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to