Re: [Toolserver-l] Maintenance: Rebooting ortelius web server
Marlen Caemmerer marlen.caemme...@wikimedia.de wrote: I would like to reboot ortelius, one of the web servers at tomorrow, Tuesday 1830 UTC Apparently, wolfsbane rebooted today as well: | timl@wolfsbane:~$ uptime | 16:49pm up 5:00, 2 users, load average: 1.16, 1.24, 1.47 | timl@wolfsbane:~$ Perhaps related to that, SGE queues on ortelius and wolfs- bane are in state au (alarm, unknown): | timl@wolfsbane:~$ qstat -f -explain a | sed -ne '1,2p' -e '/ortelius\|wolfsbane/,/^-/p' | queuename qtype resv/used/tot. load_avg arch states | - | short-sol@ortelius.toolserver. B 0/0/8 -NA- sol-amd64 au | error: no value for np_load_short because execd is in unknown state | error: no value for np_load_avg because execd is in unknown state | error: no value for cpu because execd is in unknown state | error: no value for mem_free because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=200M | alarm gf:available=1 load-threshold=0 | - | short-sol@wolfsbane.toolserver B 0/10/12-NA- sol-amd64 au | error: no value for np_load_short because execd is in unknown state | error: no value for np_load_avg because execd is in unknown state | error: no value for cpu because execd is in unknown state | error: no value for mem_free because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=200M | alarm gf:available=1 load-threshold=0 | - | medium-sol@ortelius.toolserver B 0/0/4 -NA- sol-amd64 au | error: no value for np_load_short because execd is in unknown state | error: no value for np_load_avg because execd is in unknown state | error: no value for np_load_long because execd is in unknown state | error: no value for cpu because execd is in unknown state | error: no value for mem_free because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=100M | alarm gf:available=1 load-threshold=0 | - | medium-sol@wolfsbane.toolserve B 0/3/4 -NA- sol-amd64 au | error: no value for np_load_short because execd is in unknown state | error: no value for np_load_avg because execd is in unknown state | error: no value for np_load_long because execd is in unknown state | error: no value for cpu because execd is in unknown state | error: no value for mem_free because execd is in unknown state | alarm gf:tmp_free=100G load-threshold=100M | alarm gf:available=1 load-threshold=0 | - | timl@wolfsbane:~$ Tim ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Maintenance: Rebooting ortelius web server
On 09/05/13 18:57, Tim Landscheidt wrote: Marlen Caemmerer marlen.caemme...@wikimedia.de wrote: I would like to reboot ortelius, one of the web servers at tomorrow, Tuesday 1830 UTC Apparently, wolfsbane rebooted today as well: | timl@wolfsbane:~$ uptime | 16:49pm up 5:00, 2 users, load average: 1.16, 1.24, 1.47 | timl@wolfsbane:~$ Perhaps related to that, SGE queues on ortelius and wolfs- bane are in state au (alarm, unknown): Yes, sge_execd seems not to be running on them. Plus medium and longrun queues in yarrow are in error state. I tried cleaning them, but they failed again. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
[Toolserver-l] Inodes have run out on yarrow's /var
(anonymous) wrote: [...] Plus medium and longrun queues in yarrow are in error state. I tried cleaning them, but they failed again. I think I found the culprit: | timl@yarrow:~$ df -i /var/spool/cron/atjobs | FilesystemInodes IUsed IFree IUse% Mounted on | /dev/mapper/yarrow0-var | 915712 915712 0 100% /var | timl@yarrow:~$ With my privileges, I can't find out what's causing this. What I would look at first if I could would be /var/log/iptraf and /var/spool/postfix/*. After fixing this, we need Nagios alerts for /var as well. Tim P. S.: Toolserver Office Hour + 10 days = today. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette