*Timeline*
Failures began: 24 Nov 2015 1900 GMT Notifications of failures began: 24 Nov 2015 1930 GMT Service restored for apt: 24 Nov 2015 2000 GMT Service partially restored for yum: 24 Nov 2015 2200 GMT Service fully restored: 24 Nov 2015 2320 GMT *Impact* For some part of the outage (depending on the platform) repositories on yum.puppetlabs.com and apt.puppetlabs.com were unusable due to missing content, resulting in 404 errors for both repository metadata and packages. Several users contacted us via IRC, Twitter, JIRA, and other means. What happened? Puppet Labs manages a number of servers that run our public package repository infrastructure. One of these servers was recently taken down for emergency maintenance by our provider (which was not disruptive). After maintenance was completed, we manually initiated a synchronization job to update the package contents and metadata on this server. Due to human error this sync was run in reverse, from the out-of-date node to an up-to-date node. As this process is destructive[1], it resulted in loss of data and corruption of repository metadata. What we did to restore service As part of recent efforts to improve how we ship software to our apt repositories, a complete backup of those repositories already exists on disk. Shortly after triaging the issue we were able to restore apt.puppetlabs.com from that copy. To restore the yum repositories we had to initiate a synchronization job from an internal system. The synchronization job took a considerable amount of time due to the number of rpm packages and yum metadata that needed to be restored. Next steps To remediate this in the future we’re continuing to update our release automation and process to better leverage staging infrastructure before changes are shipped to production infrastructure. This will help ensure that we always have a complete copy of our repositories that can be redeployed in case of emergency. We’re also implementing improved monitoring of this infrastructure and the package repositories it serves. This outage exposed some deficiencies in how we communicate issues like this and we are committed to developing a better communication plan for future incidents. We apologize for the outage, Puppet Labs Release Engineering ------- [1] - The synchronization process is destructive (not strictly additive) in order to remove stale metadata when new packages are published -- Morgan Rhodes mor...@puppetlabs.com Release Engineer -- You received this message because you are subscribed to the Google Groups "Puppet Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/CA%2BFnDv1yEVy%3D8nEn6Gb289yFQ%2B2F5Rmtc%3DdtWwz%3DGA_B5-vjWg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.