I too can't run, as Alebot, a very simple IRC script (that needs longrun); qstat states that it remains into a qw status, and qstat -j tells something exoteric mentioning "overload".
I'll follow this thread to see if the issue will be solved. Alex 2013/5/17 Merlissimo <m...@toolserver.org> > That is a sge scheduler problem. > > I could not commend your sge ticket because jira does not accept my jira > token. The load limit is set ok because we use np_load_* values which is > the load divided by the number of cores on this host. So e.g. sge stop > scheduling jobs on nightshade if host load is more than 20. So i think > increasing this value does not make sense. > > You output below contains load adjustments: > queue instance > "longrun-lx@yarrow.toolserver.**org<longrun...@yarrow.toolserver.org>" > dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 > * 16.000000 with nproc=4) >= 3.1 > means that there is a normalized host load of 0.015000 on yarrow and 16 > jobs are started within the last 4,5 minutes (=load_adjustment_time). sge > temporary (for the first 4,5 minutes of a job lifetime) adds some expected > load for new jobs to be not overloaded in future. Most new jobs normally > needs some starting until they really use all need resources. This prevents > scheduling to much jobs at once to one execd client. > > But as you can also see in real there are no new jobs. This is problem the > response from master: > > $qping -info damiana 536 qmaster 1 > 05/17/2013 07:03:14: > SIRM version: 0.1 > SIRM message id: 1 > start time: 05/15/2013 23:47:49 (1368661669) > run time [s]: 112525 > messages in read buffer: 0 > messages in write buffer: 0 > nr. of connected clients: 8 > status: 1 > info: MAIN: E (112524.48) | signaler000: E (112523.98) > | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) | > worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) | > listener001: E (5.03) | WARNING > > All theads are in error state including the scheduler thread. So the > schedular does not accept status updates send by all execd and so it does > not know about finished jobs and load updates. Thats why you see on qstat > output an (not existing) overload problem and no running jobs (although > some old long running jobs are still running). > > I think this could be solved by restarting the master scheduler process. > That is why i (as sge operator) send a kill command to the scheduler on > damiana and hoped that the ha_cluster automatically restarts this > process/service. But this is sadly not the case. So we have to wait until a > ts admin can restart this service manually. > > In between submitting new jobs will return an error, sorry for that. > All running or queued jobs are not affected and will keep running or > queued. > > Merlissimo > > Am 17.05.2013 03:41, schrieb Tim Landscheidt: > >> Hi, >> >> a "qstat -j" of a simple job yields inter alia: >> >> | scheduling info: queue instance "longrun-sol@willow.** >> toolserver.org <longrun-...@willow.toolserver.org>" dropped because it >> is temporarily not available >> | queue instance " >> short-sol@willow.toolserver.**org <short-...@willow.toolserver.org>" >> dropped because it is temporarily not available >> | queue instance "medium-lx@mayapple.** >> toolserver.org <medium...@mayapple.toolserver.org>" dropped because it >> is temporarily not available >> | queue instance "longrun3-sol@willow.** >> toolserver.org <longrun3-...@willow.toolserver.org>" dropped because it >> is temporarily not available >> | queue instance "longrun2-sol@clematis.** >> toolserver.org <longrun2-...@clematis.toolserver.org>" dropped because >> it is disabled >> | queue instance "longrun2-sol@hawthorn.** >> toolserver.org <longrun2-...@hawthorn.toolserver.org>" dropped because >> it is disabled >> | queue instance "medium-sol@ortelius.** >> toolserver.org <medium-...@ortelius.toolserver.org>" dropped because it >> is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with >> nproc=4) >= 0.75 >> | queue instance " >> medium-lx@yarrow.toolserver.**org <medium...@yarrow.toolserver.org>" >> dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8 >> * 6.000000 with nproc=4) >= 1.2 >> | queue instance "medium-lx@nightshade.** >> toolserver.org <medium...@nightshade.toolserver.org>" dropped because it >> is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with >> nproc=8) >= 1.2 >> | queue instance "medium-sol@wolfsbane.** >> toolserver.org <medium-...@wolfsbane.toolserver.org>" dropped because it >> is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with >> nproc=8) >= 0.75 >> | queue instance "short-sol@wolfsbane.** >> toolserver.org <short-...@wolfsbane.toolserver.org>" dropped because it >> is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with >> nproc=8) >= 1.2 >> | queue instance "short-sol@ortelius.** >> toolserver.org <short-...@ortelius.toolserver.org>" dropped because it >> is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with >> nproc=4) >= 1.2 >> | queue instance " >> longrun-lx@yarrow.toolserver.**org <longrun...@yarrow.toolserver.org>" >> dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8 >> * 16.000000 with nproc=4) >= 3.1 >> | queue instance "longrun-lx@nightshade.** >> toolserver.org <longrun...@nightshade.toolserver.org>" dropped because >> it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M * >> 29.000000) <= 500 >> >> At the moment, we have /no/ jobs scheduled by SGE running. >> Meanwhile, the hosts are idling: >> >> | queuename qtype resv/used/tot. load_avg arch >> states >> | ------------------------------**------------------------------** >> --------------------- >> | short-sol@ortelius.toolserver. B 0/0/8 1.52 sol-amd64 >> | ------------------------------**------------------------------** >> --------------------- >> | short-...@willow.toolserver.or B 0/0/8 -NA- sol-amd64 >> au >> | ------------------------------**------------------------------** >> --------------------- >> | short-sol@wolfsbane.toolserver B 0/0/12 0.64 sol-amd64 >> | ------------------------------**------------------------------** >> --------------------- >> | medium-lx@mayapple.toolserver. B 0/0/32 -NA- linux-x64 >> adu >> | ------------------------------**------------------------------** >> --------------------- >> | medium-lx@nightshade.toolserve B 0/0/8 1.05 linux-x64 >> | ------------------------------**------------------------------** >> --------------------- >> | medium...@yarrow.toolserver.or B 0/0/8 0.02 linux-x64 >> | ------------------------------**------------------------------** >> --------------------- >> | longrun-lx@nightshade.toolserv BI 0/0/64 1.05 linux-x64 >> | ------------------------------**------------------------------** >> --------------------- >> | longrun-lx@yarrow.toolserver.o BI 0/0/64 0.02 linux-x64 >> | ------------------------------**------------------------------** >> --------------------- >> | longrun-sol@willow.toolserver. BI 0/0/64 -NA- sol-amd64 >> au >> | ------------------------------**------------------------------** >> --------------------- >> | medium-sol@ortelius.toolserver B 0/0/4 1.52 sol-amd64 >> | ------------------------------**------------------------------** >> --------------------- >> | medium-sol@wolfsbane.toolserve B 0/0/4 0.64 sol-amd64 >> | ------------------------------**------------------------------** >> --------------------- >> | longrun2-sol@clematis.toolserv B 0/0/8 0.03 sol-amd64 >> d >> | ------------------------------**------------------------------** >> --------------------- >> | longrun2-sol@hawthorn.toolserv B 0/0/8 0.23 sol-amd64 >> d >> | ------------------------------**------------------------------** >> --------------------- >> | longrun3-sol@willow.toolserver B 0/0/4 -NA- sol-amd64 >> aduE >> >> I filed >> https://jira.toolserver.org/**browse/TS-1650<https://jira.toolserver.org/browse/TS-1650>on >> Monday >> to no avail so far. >> >> Tim >> >> >> ______________________________**_________________ >> Toolserver-l mailing list >> (Toolserver-l@lists.wikimedia.**org<Toolserver-l@lists.wikimedia.org> >> ) >> https://lists.wikimedia.org/**mailman/listinfo/toolserver-l<https://lists.wikimedia.org/mailman/listinfo/toolserver-l> >> Posting guidelines for this list: https://wiki.toolserver.org/** >> view/Mailing_list_etiquette<https://wiki.toolserver.org/view/Mailing_list_etiquette> >> >> > ______________________________**_________________ > Toolserver-l mailing list > (Toolserver-l@lists.wikimedia.**org<Toolserver-l@lists.wikimedia.org> > ) > https://lists.wikimedia.org/**mailman/listinfo/toolserver-l<https://lists.wikimedia.org/mailman/listinfo/toolserver-l> > Posting guidelines for this list: https://wiki.toolserver.org/** > view/Mailing_list_etiquette<https://wiki.toolserver.org/view/Mailing_list_etiquette>
_______________________________________________ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette