(anonymous) wrote: > That is a sge scheduler problem.
> I could not commend your sge ticket because jira does not > accept my jira token. The load limit is set ok because we > use np_load_* values which is the load divided by the number > of cores on this host. So e.g. sge stop scheduling jobs on > nightshade if host load is more than 20. So i think > increasing this value does not make sense. You're probably right about this. I was assuming the load goal of 2 just from the symptoms displayed. > You output below contains load adjustments: > queue instance "longrun...@yarrow.toolserver.org" dropped > because it is overloaded: np_load_short=3.215000 (= 0.015000 > + 0.8 * 16.000000 with nproc=4) >= 3.1 > means that there is a normalized host load of 0.015000 on > yarrow and 16 jobs are started within the last 4,5 minutes > (=load_adjustment_time). sge temporary (for the first 4,5 > minutes of a job lifetime) adds some expected load for new > jobs to be not overloaded in future. Most new jobs normally > needs some starting until they really use all need > resources. This prevents scheduling to much jobs at once to > one execd client. > But as you can also see in real there are no new jobs. This > is problem the response from master: > $qping -info damiana 536 qmaster 1 > 05/17/2013 07:03:14: > SIRM version: 0.1 > SIRM message id: 1 > start time: 05/15/2013 23:47:49 (1368661669) > run time [s]: 112525 > messages in read buffer: 0 > messages in write buffer: 0 > nr. of connected clients: 8 > status: 1 > info: MAIN: E (112524.48) | signaler000: > E (112523.98) | event_master000: E (0.27) | timer000: E > (4.27) | worker000: E (7.05) | worker001: E (8.93) | > listener000: E (1.03) | scheduler000: E (8.93) | > listener001: E (5.03) | WARNING > All theads are in error state including the scheduler > thread. So the schedular does not accept status updates send > by all execd and so it does not know about finished jobs and > load updates. Thats why you see on qstat output an (not > existing) overload problem and no running jobs (although > some old long running jobs are still running). > I think this could be solved by restarting the master scheduler process. > That is why i (as sge operator) send a kill command to the > scheduler on damiana and hoped that the ha_cluster > automatically restarts this process/service. But this is > sadly not the case. So we have to wait until a ts admin can > restart this service manually. > In between submitting new jobs will return an error, sorry for that. > All running or queued jobs are not affected and will keep running or queued. > [...] Thanks for tracking this down! Looking at qstat -u *, it seems to have recovered now. Tim P. S.: Regarding JIRA, did I miss any followup to http://permalink.gmane.org/gmane.org.wikimedia.toolserver/5241? _______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette