Re: [Toolserver-l] SGE queues stalled
Am 05.12.2012 16:21, schrieb Morten Wang: Is there a way for me to find that out myself, e.g. using qstat? I had a look at the qstat man-page, but judging by the descriptions it looks like something I'd have to fiddle around with if/when a job gets queued for a long time at some point in the future to figure out how to do. qstat -j lists a scheduling info section. Example: qstat -j 799111 scheduling info: queue instance "short-...@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.252930 (= 1.252930 + 0.8 * 0.00 with nproc=4) >= 1.1 queue instance "longrun-...@willow.toolserver.org" dropped because it is overloaded: np_load_short=2.528320 (= 2.528320 + 0.8 * 0.00 with nproc=8) >= 2.0 queue instance "medium-...@ortelius.toolserver.org" dropped because it is overloaded: np_load_short=1.252930 (= 1.252930 + 0.8 * 0.00 with nproc=4) >= 0.8 queue instance "longrun2-...@clematis.toolserver.org" dropped because it is disabled queue instance "longrun2-...@hawthorn.toolserver.org" dropped because it is disabled (-l h_rt=57600,mem_free=890M,sql=1,sql-s7-rr=3,sqlprocs-s7=3,tmp_free=20M,user_slot=2,virtual_free=890M) cannot run globally because it offers only gc:sql-s7-rr=0.00 As you can see the job cannot run on clematis and hawthorn, because these queues are disabled. queues on willow and ortelius have temporary high load. wolfsbane, nightshade and yarrow are missing in this list so the bot could start on these servers. But the last line "cannot run globally because it offers only gc:sql-s7-rr=0.00" shows that resource sql-s7-rr is not available on any server at the moment. That's why the job is queued until s7 database is usable again. Merlissimo ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] SGE queues stalled
Ah, didn't think of that, of course the obvious explanation. Thanks for looking into that! Is there a way for me to find that out myself, e.g. using qstat? I had a look at the qstat man-page, but judging by the descriptions it looks like something I'd have to fiddle around with if/when a job gets queued for a long time at some point in the future to figure out how to do. Regards, Morten On 5 December 2012 07:11, Merlissimo wrote: > Server sql-s1-rr was unavailable during the night. So resource sql-s1-rr > was 0. > > Because i am not a ts admin i could not check that you requested this > resource for this jobs. But just now nosy had a look and confirmed my > suspicion. The job was started after resource sql-s1-rr was available again. > > Merlissimo > > Am 04.12.2012 16:44, schrieb Morten Wang: > > Looks like the issue got resolved around 09:00UTC, as from the qacct >> output: >> >> jobname opentasks >> jobnumber 873860 >> [...] >> qsub_time Mon Dec 3 22:19:03 2012 >> start_time Tue Dec 4 09:06:32 2012 >> end_time Tue Dec 4 09:21:18 2012 >> >> If you want to look into it more closely, this job was submitted by me >> (user: nettrom) through my crontab on the submit servers. >> >> >> Cheers, >> Morten >> >> > ___ > Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) > https://lists.wikimedia.org/mailman/listinfo/toolserver-l > Posting guidelines for this list: > https://wiki.toolserver.org/view/Mailing_list_etiquette > ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] SGE queues stalled
Server sql-s1-rr was unavailable during the night. So resource sql-s1-rr was 0. Because i am not a ts admin i could not check that you requested this resource for this jobs. But just now nosy had a look and confirmed my suspicion. The job was started after resource sql-s1-rr was available again. Merlissimo Am 04.12.2012 16:44, schrieb Morten Wang: Looks like the issue got resolved around 09:00UTC, as from the qacct output: jobname opentasks jobnumber 873860 [...] qsub_time Mon Dec 3 22:19:03 2012 start_time Tue Dec 4 09:06:32 2012 end_time Tue Dec 4 09:21:18 2012 If you want to look into it more closely, this job was submitted by me (user: nettrom) through my crontab on the submit servers. Cheers, Morten ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Defective database s7
Hello, On Fri, 16 Nov 2012, Marlen Caemmerer wrote: as some of you might have noticed s7 is badly corrupted. I finally redumped the s7 instance into a new s7. Unfortunatelly it still has no centralauth.localnames. We are still waiting for the WMF to send us a copy of s7 but for now I could start the replication again. This will catch up in the next days and your dbs are at least writable again. If anything is missing still please tell me I still have all the data. DaB, please be so kind to restart the replication of wikidata. Cheers nosy ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette