Re: [Toolserver-l] SGE queues stalled

2012-12-05 Thread Merlissimo

Am 05.12.2012 16:21, schrieb Morten Wang:


Is there a way for me to find that out myself, e.g. using qstat?  I had a
look at the qstat man-page, but judging by the descriptions it looks like
something I'd have to fiddle around with if/when a job gets queued for a
long time at some point in the future to figure out how to do.


qstat -j 

lists a scheduling info section.

Example:
qstat -j 799111

scheduling info:

queue instance "short-...@ortelius.toolserver.org" dropped because it is 
overloaded: np_load_short=1.252930 (= 1.252930 + 0.8 * 0.00 with 
nproc=4) >= 1.1
queue instance "longrun-...@willow.toolserver.org" dropped because it is 
overloaded: np_load_short=2.528320 (= 2.528320 + 0.8 * 0.00 with 
nproc=8) >= 2.0
queue instance "medium-...@ortelius.toolserver.org" dropped because it 
is overloaded: np_load_short=1.252930 (= 1.252930 + 0.8 * 0.00 with 
nproc=4) >= 0.8
queue instance "longrun2-...@clematis.toolserver.org" dropped because it 
is disabled
queue instance "longrun2-...@hawthorn.toolserver.org" dropped because it 
is disabled
(-l 
h_rt=57600,mem_free=890M,sql=1,sql-s7-rr=3,sqlprocs-s7=3,tmp_free=20M,user_slot=2,virtual_free=890M) 
cannot run globally because it offers only gc:sql-s7-rr=0.00


As you can see the job cannot run on clematis and hawthorn, because 
these queues are disabled. queues on willow and ortelius have temporary 
high load. wolfsbane, nightshade and yarrow are missing in this list so 
the bot could start on these servers. But the last line "cannot run 
globally because it offers only gc:sql-s7-rr=0.00" shows that 
resource sql-s7-rr is not available on any server at the moment. That's 
why the job is queued until s7 database is usable again.


Merlissimo

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] SGE queues stalled

2012-12-05 Thread Morten Wang
Ah, didn't think of that, of course the obvious explanation. Thanks for
looking into that!

Is there a way for me to find that out myself, e.g. using qstat?  I had a
look at the qstat man-page, but judging by the descriptions it looks like
something I'd have to fiddle around with if/when a job gets queued for a
long time at some point in the future to figure out how to do.


Regards,
Morten


On 5 December 2012 07:11, Merlissimo  wrote:

> Server sql-s1-rr was unavailable during the night. So resource sql-s1-rr
> was 0.
>
> Because i am not a ts admin i could not check that you requested this
> resource for this jobs. But just now nosy had a look and confirmed my
> suspicion. The job was started after resource sql-s1-rr was available again.
>
> Merlissimo
>
> Am 04.12.2012 16:44, schrieb Morten Wang:
>
>  Looks like the issue got resolved around 09:00UTC, as from the qacct
>> output:
>>
>> jobname opentasks
>> jobnumber 873860
>> [...]
>> qsub_time Mon Dec 3 22:19:03 2012
>> start_time Tue Dec 4 09:06:32 2012
>> end_time Tue Dec 4 09:21:18 2012
>>
>> If you want to look into it more closely, this job was submitted by me
>> (user: nettrom) through my crontab on the submit servers.
>>
>>
>> Cheers,
>> Morten
>>
>>
> ___
> Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
> https://lists.wikimedia.org/mailman/listinfo/toolserver-l
> Posting guidelines for this list:
> https://wiki.toolserver.org/view/Mailing_list_etiquette
>
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] SGE queues stalled

2012-12-05 Thread Merlissimo
Server sql-s1-rr was unavailable during the night. So resource sql-s1-rr 
was 0.


Because i am not a ts admin i could not check that you requested this 
resource for this jobs. But just now nosy had a look and confirmed my 
suspicion. The job was started after resource sql-s1-rr was available again.


Merlissimo

Am 04.12.2012 16:44, schrieb Morten Wang:

Looks like the issue got resolved around 09:00UTC, as from the qacct output:

jobname opentasks
jobnumber 873860
[...]
qsub_time Mon Dec 3 22:19:03 2012
start_time Tue Dec 4 09:06:32 2012
end_time Tue Dec 4 09:21:18 2012

If you want to look into it more closely, this job was submitted by me
(user: nettrom) through my crontab on the submit servers.


Cheers,
Morten



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Defective database s7

2012-12-05 Thread Marlen Caemmerer

Hello,

On Fri, 16 Nov 2012, Marlen Caemmerer wrote:



as some of you might have noticed s7 is badly corrupted.



I finally redumped the s7 instance into a new s7. 
Unfortunatelly it still has no centralauth.localnames.

We are still waiting for the WMF to send us a copy of s7 but for now I could 
start the replication again.

This will catch up in the next days and your dbs are at least writable again.

If anything is missing still please tell me I still have all the data.

DaB, please be so kind to restart the replication of wikidata.

Cheers
nosy

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette