Re: [Toolserver-l] Maintenance: Rebooting ortelius web server

2013-05-09 Thread Tim Landscheidt
Marlen Caemmerer marlen.caemme...@wikimedia.de wrote:

 I would like to reboot ortelius, one of the web servers at

 tomorrow, Tuesday 1830 UTC

Apparently, wolfsbane rebooted today as well:

| timl@wolfsbane:~$ uptime
|  16:49pm  up   5:00,  2 users,  load average: 1.16, 1.24, 1.47
| timl@wolfsbane:~$

Perhaps related to that, SGE queues on ortelius and wolfs-
bane are in state au (alarm, unknown):

| timl@wolfsbane:~$ qstat -f -explain a | sed -ne '1,2p' -e 
'/ortelius\|wolfsbane/,/^-/p'
| queuename  qtype resv/used/tot. load_avg arch  
states
| 
-
| short-sol@ortelius.toolserver. B 0/0/8  -NA- sol-amd64 au
| error: no value for np_load_short because execd is in unknown state
| error: no value for np_load_avg because execd is in unknown state
| error: no value for cpu because execd is in unknown state
| error: no value for mem_free because execd is in unknown state
| alarm gf:tmp_free=100G load-threshold=200M
| alarm gf:available=1 load-threshold=0
| 
-
| short-sol@wolfsbane.toolserver B 0/10/12-NA- sol-amd64 au
| error: no value for np_load_short because execd is in unknown state
| error: no value for np_load_avg because execd is in unknown state
| error: no value for cpu because execd is in unknown state
| error: no value for mem_free because execd is in unknown state
| alarm gf:tmp_free=100G load-threshold=200M
| alarm gf:available=1 load-threshold=0
| 
-
| medium-sol@ortelius.toolserver B 0/0/4  -NA- sol-amd64 au
| error: no value for np_load_short because execd is in unknown state
| error: no value for np_load_avg because execd is in unknown state
| error: no value for np_load_long because execd is in unknown state
| error: no value for cpu because execd is in unknown state
| error: no value for mem_free because execd is in unknown state
| alarm gf:tmp_free=100G load-threshold=100M
| alarm gf:available=1 load-threshold=0
| 
-
| medium-sol@wolfsbane.toolserve B 0/3/4  -NA- sol-amd64 au
| error: no value for np_load_short because execd is in unknown state
| error: no value for np_load_avg because execd is in unknown state
| error: no value for np_load_long because execd is in unknown state
| error: no value for cpu because execd is in unknown state
| error: no value for mem_free because execd is in unknown state
| alarm gf:tmp_free=100G load-threshold=100M
| alarm gf:available=1 load-threshold=0
| 
-
| timl@wolfsbane:~$

Tim


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Re: [Toolserver-l] Maintenance: Rebooting ortelius web server

2013-05-09 Thread Platonides
On 09/05/13 18:57, Tim Landscheidt wrote:
 Marlen Caemmerer marlen.caemme...@wikimedia.de wrote:
 
 I would like to reboot ortelius, one of the web servers at
 
 tomorrow, Tuesday 1830 UTC
 
 Apparently, wolfsbane rebooted today as well:
 
 | timl@wolfsbane:~$ uptime
 |  16:49pm  up   5:00,  2 users,  load average: 1.16, 1.24, 1.47
 | timl@wolfsbane:~$
 
 Perhaps related to that, SGE queues on ortelius and wolfs-
 bane are in state au (alarm, unknown):

Yes, sge_execd seems not to be running on them.

Plus medium and longrun queues in yarrow are in error state. I tried
cleaning them, but they failed again.

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

[Toolserver-l] Inodes have run out on yarrow's /var

2013-05-09 Thread Tim Landscheidt
(anonymous) wrote:

 [...]

 Plus medium and longrun queues in yarrow are in error state. I tried
 cleaning them, but they failed again.

I think I found the culprit:

| timl@yarrow:~$ df -i /var/spool/cron/atjobs
| FilesystemInodes   IUsed   IFree IUse% Mounted on
| /dev/mapper/yarrow0-var
|   915712  915712   0  100% /var
| timl@yarrow:~$

With my privileges, I can't find out what's causing this.
What I would look at first if I could would be
/var/log/iptraf and /var/spool/postfix/*.

After fixing this, we need Nagios alerts for /var as well.

Tim

P. S.: Toolserver Office Hour + 10 days = today.


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette