(anonymous) wrote:

> That is a sge scheduler problem.

> I could not commend your sge ticket because jira does not
> accept my jira token. The load limit is set ok because we
> use np_load_* values which is the load divided by the number
> of cores on this host. So e.g. sge stop scheduling jobs on
> nightshade if host load is more than 20. So i think
> increasing this value does not make sense.

You're probably right about this.  I was assuming the load
goal of 2 just from the symptoms displayed.

> You output below contains load adjustments:
>   queue instance "longrun...@yarrow.toolserver.org" dropped
> because it is overloaded: np_load_short=3.215000 (= 0.015000
> + 0.8 * 16.000000 with nproc=4) >= 3.1
> means that there is a normalized host load of 0.015000 on
> yarrow and 16 jobs are started within the last 4,5 minutes
> (=load_adjustment_time). sge temporary (for the first 4,5
> minutes of a job lifetime) adds some expected load for new
> jobs to be not overloaded in future. Most new jobs normally
> needs some starting until they really use all need
> resources. This prevents scheduling to much jobs at once to
> one execd client.

> But as you can also see in real there are no new jobs. This
> is problem the response from master:

> $qping -info damiana 536 qmaster 1
> 05/17/2013 07:03:14:
> SIRM version:             0.1
> SIRM message id:          1
> start time:               05/15/2013 23:47:49 (1368661669)
> run time [s]:             112525
> messages in read buffer:  0
> messages in write buffer: 0
> nr. of connected clients: 8
> status:                   1
> info:                     MAIN: E (112524.48) | signaler000:
> E (112523.98) | event_master000: E (0.27) | timer000: E
> (4.27) | worker000: E (7.05) | worker001: E (8.93) |
> listener000: E (1.03) | scheduler000: E (8.93) |
> listener001: E (5.03) | WARNING

> All theads are in error state including the scheduler
> thread. So the schedular does not accept status updates send
> by all execd and so it does not know about finished jobs and
> load updates. Thats why you see on qstat output an (not
> existing) overload problem and no running jobs (although
> some old long running jobs are still running).

> I think this could be solved by restarting the master scheduler process.
> That is why i (as sge operator) send a kill command to the
> scheduler on damiana and hoped that the ha_cluster
> automatically restarts this process/service. But this is
> sadly not the case. So we have to wait until a ts admin can
> restart this service manually.

> In between submitting new jobs will return an error, sorry for that.
> All running or queued jobs are not affected and will keep running or queued.

> [...]

Thanks for tracking this down!  Looking at qstat -u *, it
seems to have recovered now.

Tim

P. S.: Regarding JIRA, did I miss any followup to
       http://permalink.gmane.org/gmane.org.wikimedia.toolserver/5241?


_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to