[Toolserver-l] Postmortem: General downtime yesterday

DaB. Sun, 24 Feb 2013 06:23:34 -0800

Hello all,

during the maintenance window yesterday evening the hole cluster was down for 
~30min starting ~21:20 UTC. The problem was independent of the maintenance 
working, but caused the window to extend.
The problem was an out-of-memory on one of our HA-nodes. Unfortunately the box 
did not restart itself and its ha-buddy did not detect the problem too, so the 
services of the out-of-memory-box were not switched to the other box. This 
caused the hole cluster to stand until I manually rebooted the host. I will 
look if I can find some kind of sensor for that; in worst case I will enable 
our old "reboot if low on memory"-script again.


Sincerely,
DaB.

-- 
Userpage: [[:w:de:User:DaB.]] — PGP: 0x2d3ee2d42b255885

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

[Toolserver-l] Postmortem: General downtime yesterday

Reply via email to