I too can't run, as Alebot,  a very simple IRC script (that needs longrun);
qstat states that it remains into a qw status, and qstat -j tells something
exoteric mentioning "overload".

I'll follow this thread to see if the issue will be solved.

Alex


2013/5/17 Merlissimo <m...@toolserver.org>

> That is a sge scheduler problem.
>
> I could not commend your sge ticket because jira does not accept my jira
> token. The load limit is set ok because we use np_load_* values which is
> the load divided by the number of cores on this host. So e.g. sge stop
> scheduling jobs on nightshade if host load is more than 20. So i think
> increasing this value does not make sense.
>
> You output below contains load adjustments:
>   queue instance 
> "longrun-lx@yarrow.toolserver.**org<longrun...@yarrow.toolserver.org>"
> dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8
> * 16.000000 with nproc=4) >= 3.1
> means that there is a normalized host load of 0.015000 on yarrow and 16
> jobs are started within the last 4,5 minutes (=load_adjustment_time). sge
> temporary (for the first 4,5 minutes of a job lifetime) adds some expected
> load for new jobs to be not overloaded in future. Most new jobs normally
> needs some starting until they really use all need resources. This prevents
> scheduling to much jobs at once to one execd client.
>
> But as you can also see in real there are no new jobs. This is problem the
> response from master:
>
> $qping -info damiana 536 qmaster 1
> 05/17/2013 07:03:14:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               05/15/2013 23:47:49 (1368661669)
> run time [s]:             112525
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 8
> status:                   1
> info:                     MAIN: E (112524.48) | signaler000: E (112523.98)
> | event_master000: E (0.27) | timer000: E (4.27) | worker000: E (7.05) |
> worker001: E (8.93) | listener000: E (1.03) | scheduler000: E (8.93) |
> listener001: E (5.03) | WARNING
>
> All theads are in error state including the scheduler thread. So the
> schedular does not accept status updates send by all execd and so it does
> not know about finished jobs and load updates. Thats why you see on qstat
> output an (not existing) overload problem and no running jobs (although
> some old long running jobs are still running).
>
> I think this could be solved by restarting the master scheduler process.
> That is why i (as sge operator) send a kill command to the scheduler on
> damiana and hoped that the ha_cluster automatically restarts this
> process/service. But this is sadly not the case. So we have to wait until a
> ts admin can restart this service manually.
>
> In between submitting new jobs will return an error, sorry for that.
> All running or queued jobs are not affected and will keep running or
> queued.
>
> Merlissimo
>
> Am 17.05.2013 03:41, schrieb Tim Landscheidt:
>
>> Hi,
>>
>> a "qstat -j" of a simple job yields inter alia:
>>
>> | scheduling info:            queue instance "longrun-sol@willow.**
>> toolserver.org <longrun-...@willow.toolserver.org>" dropped because it
>> is temporarily not available
>> |                             queue instance "
>> short-sol@willow.toolserver.**org <short-...@willow.toolserver.org>"
>> dropped because it is temporarily not available
>> |                             queue instance "medium-lx@mayapple.**
>> toolserver.org <medium...@mayapple.toolserver.org>" dropped because it
>> is temporarily not available
>> |                             queue instance "longrun3-sol@willow.**
>> toolserver.org <longrun3-...@willow.toolserver.org>" dropped because it
>> is temporarily not available
>> |                             queue instance "longrun2-sol@clematis.**
>> toolserver.org <longrun2-...@clematis.toolserver.org>" dropped because
>> it is disabled
>> |                             queue instance "longrun2-sol@hawthorn.**
>> toolserver.org <longrun2-...@hawthorn.toolserver.org>" dropped because
>> it is disabled
>> |                             queue instance "medium-sol@ortelius.**
>> toolserver.org <medium-...@ortelius.toolserver.org>" dropped because it
>> is overloaded: np_load_short=0.791601 (= 0.391601 + 0.8 * 2.000000 with
>> nproc=4) >= 0.75
>> |                             queue instance "
>> medium-lx@yarrow.toolserver.**org <medium...@yarrow.toolserver.org>"
>> dropped because it is overloaded: np_load_short=1.215000 (= 0.015000 + 0.8
>> * 6.000000 with nproc=4) >= 1.2
>> |                             queue instance "medium-lx@nightshade.**
>> toolserver.org <medium...@nightshade.toolserver.org>" dropped because it
>> is overloaded: np_load_short=1.227500 (= 0.127500 + 0.8 * 11.000000 with
>> nproc=8) >= 1.2
>> |                             queue instance "medium-sol@wolfsbane.**
>> toolserver.org <medium-...@wolfsbane.toolserver.org>" dropped because it
>> is overloaded: np_load_short=0.778613 (= 0.078613 + 0.8 * 7.000000 with
>> nproc=8) >= 0.75
>> |                             queue instance "short-sol@wolfsbane.**
>> toolserver.org <short-...@wolfsbane.toolserver.org>" dropped because it
>> is overloaded: np_load_short=1.278613 (= 0.078613 + 0.8 * 12.000000 with
>> nproc=8) >= 1.2
>> |                             queue instance "short-sol@ortelius.**
>> toolserver.org <short-...@ortelius.toolserver.org>" dropped because it
>> is overloaded: np_load_short=1.391601 (= 0.391601 + 0.8 * 5.000000 with
>> nproc=4) >= 1.2
>> |                             queue instance "
>> longrun-lx@yarrow.toolserver.**org <longrun...@yarrow.toolserver.org>"
>> dropped because it is overloaded: np_load_short=3.215000 (= 0.015000 + 0.8
>> * 16.000000 with nproc=4) >= 3.1
>> |                             queue instance "longrun-lx@nightshade.**
>> toolserver.org <longrun...@nightshade.toolserver.org>" dropped because
>> it is overloaded: mem_free=-420765696.524288 (= 14098.726562M - 500M *
>> 29.000000) <= 500
>>
>> At the moment, we have /no/ jobs scheduled by SGE running.
>> Meanwhile, the hosts are idling:
>>
>> | queuename                      qtype resv/used/tot. load_avg arch
>>    states
>> | ------------------------------**------------------------------**
>> ---------------------
>> | short-sol@ortelius.toolserver. B     0/0/8          1.52     sol-amd64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | short-...@willow.toolserver.or B     0/0/8          -NA-     sol-amd64
>>     au
>> | ------------------------------**------------------------------**
>> ---------------------
>> | short-sol@wolfsbane.toolserver B     0/0/12         0.64     sol-amd64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | medium-lx@mayapple.toolserver. B     0/0/32         -NA-     linux-x64
>>     adu
>> | ------------------------------**------------------------------**
>> ---------------------
>> | medium-lx@nightshade.toolserve B     0/0/8          1.05     linux-x64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | medium...@yarrow.toolserver.or B     0/0/8          0.02     linux-x64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | longrun-lx@nightshade.toolserv BI    0/0/64         1.05     linux-x64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | longrun-lx@yarrow.toolserver.o BI    0/0/64         0.02     linux-x64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | longrun-sol@willow.toolserver. BI    0/0/64         -NA-     sol-amd64
>>     au
>> | ------------------------------**------------------------------**
>> ---------------------
>> | medium-sol@ortelius.toolserver B     0/0/4          1.52     sol-amd64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | medium-sol@wolfsbane.toolserve B     0/0/4          0.64     sol-amd64
>> | ------------------------------**------------------------------**
>> ---------------------
>> | longrun2-sol@clematis.toolserv B     0/0/8          0.03     sol-amd64
>>     d
>> | ------------------------------**------------------------------**
>> ---------------------
>> | longrun2-sol@hawthorn.toolserv B     0/0/8          0.23     sol-amd64
>>     d
>> | ------------------------------**------------------------------**
>> ---------------------
>> | longrun3-sol@willow.toolserver B     0/0/4          -NA-     sol-amd64
>>     aduE
>>
>> I filed 
>> https://jira.toolserver.org/**browse/TS-1650<https://jira.toolserver.org/browse/TS-1650>on
>>  Monday
>> to no avail so far.
>>
>> Tim
>>
>>
>> ______________________________**_________________
>> Toolserver-l mailing list 
>> (Toolserver-l@lists.wikimedia.**org<Toolserver-l@lists.wikimedia.org>
>> )
>> https://lists.wikimedia.org/**mailman/listinfo/toolserver-l<https://lists.wikimedia.org/mailman/listinfo/toolserver-l>
>> Posting guidelines for this list: https://wiki.toolserver.org/**
>> view/Mailing_list_etiquette<https://wiki.toolserver.org/view/Mailing_list_etiquette>
>>
>>
> ______________________________**_________________
> Toolserver-l mailing list 
> (Toolserver-l@lists.wikimedia.**org<Toolserver-l@lists.wikimedia.org>
> )
> https://lists.wikimedia.org/**mailman/listinfo/toolserver-l<https://lists.wikimedia.org/mailman/listinfo/toolserver-l>
> Posting guidelines for this list: https://wiki.toolserver.org/**
> view/Mailing_list_etiquette<https://wiki.toolserver.org/view/Mailing_list_etiquette>
_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to