*Timeline*

Failures began:  24 Nov 2015 1900 GMT

Notifications of failures began:  24 Nov 2015 1930 GMT

Service restored for apt:  24 Nov 2015 2000 GMT

Service partially restored for yum:  24 Nov 2015 2200 GMT

Service fully restored:  24 Nov 2015 2320 GMT

*Impact*


For some part of the outage (depending on the platform) repositories on
yum.puppetlabs.com and apt.puppetlabs.com were unusable due to missing
content, resulting in 404 errors for both repository metadata and packages.
Several users contacted us via IRC, Twitter, JIRA, and other means.

What happened?

Puppet Labs manages a number of servers that run our public package
repository infrastructure. One of these servers was recently taken down for
emergency maintenance by our provider (which was not disruptive). After
maintenance was completed, we manually initiated a synchronization job to
update the package contents and metadata on this server. Due to human error
this sync was run in reverse, from the out-of-date node to an up-to-date
node. As this process is destructive[1], it resulted in loss of data and
corruption of repository metadata.

What we did to restore service

As part of recent efforts to improve how we ship software to our apt
repositories, a complete backup of those repositories already exists on
disk. Shortly after triaging the issue we were able to restore
apt.puppetlabs.com from that copy.

To restore the yum repositories we had to initiate a synchronization job
from an internal system. The  synchronization job took a considerable
amount of time due to the number of rpm packages and yum metadata that
needed to be restored.

Next steps

To remediate this in the future we’re continuing to update our release
automation and process to better leverage staging infrastructure before
changes are shipped to production infrastructure. This will help ensure
that we always have a complete copy of our repositories that can be
redeployed in case of emergency. We’re also implementing improved
monitoring of this infrastructure and the package repositories it serves.
This outage exposed some deficiencies in how we communicate issues like
this and we are committed to developing a better communication plan for
future incidents.


We apologize for the outage,
Puppet Labs Release Engineering

-------
[1] - The synchronization process is destructive (not strictly additive) in
order to remove stale metadata when new packages are published
-- 
Morgan Rhodes
mor...@puppetlabs.com
Release Engineer

-- 
You received this message because you are subscribed to the Google Groups 
"Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to puppet-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/puppet-users/CA%2BFnDv1yEVy%3D8nEn6Gb289yFQ%2B2F5Rmtc%3DdtWwz%3DGA_B5-vjWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to