After a bit of forensics and spending most of Friday in a machine room, here is 
a synopsis of what has happened:

The OHM server is a 'redundant everything' setup, everything comes in two's 
including the power supplies. As is best practice, one was routed to the UPS 
and the other directly to the mains to allow battery maintenance without 
downtime. Sometime in the past few months, the power supply hooked up to the 
UPS failed without triggering the software alert. This resulted in a situation 
where power bumps would trigger a hard reboot with the UPS reporting "that 
everything was completely under control, move along now".

Mystery reboots and filesystem faults had been sometimes that was being 
investigated; they were too much for the /usr filesystem which had been 
corrupting itself quietly while causing random software faults. Several layers 
of disk redundancy, data consistency checking and backups have ensured that no 
OHM data was lost but the base system itself is a mess at this point. To that 
end I will reinitialize the entire storage array, reinstall from scratch and 
update the rails port. We should be back up by Friday, a placeholder has been 
put up in the meantime.

If you happen to be the person that keeps pushing pins and nails into a voodoo 
doll mockup of OHM, would you stop already? -rhw





_______________________________________________
Historic mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/historic

Reply via email to