Josh, Thank you for submitting this incident report. I created a bug to track the incident and remediation efforts: https://bugzilla.mozilla.org/show_bug.cgi?id=1486650
- Wayne On Fri, Aug 24, 2018 at 1:07 PM josh--- via dev-security-policy < dev-security-policy@lists.mozilla.org> wrote: > To see the original communication on our Community Forums, click here: > > > https://community.letsencrypt.org/t/2018-08-23-ocsp-responder-incident/70350 > > At 17:47 UTC on August 23rd, 2018 we deployed a configuration change to > our OCSP responder service that resulted in 90% of traffic to our origin > inaccurately receiving OCSP "unauthorized" statuses for valid OCSP > requests. Most OCSP responses that were cached at our CDN prior to the > incident were not affected. The change was reverted on 19:33 UTC the same > day to resolve the problem, though CDN caching may have resulted in > affected statuses being served for a limited period of time after > resolution. > > The root technical cause of this incident was [a change]( > https://github.com/letsencrypt/boulder/pull/3815) developed during a > previous incident in which malformed OCSP traffic was causing excessive > strain on the OCSP responder. Unfortunately [a bug in the implementation]( > https://github.com/letsencrypt/boulder/issues/3829) improperly rejected > OCSP requests unless they matched the last configured serial prefix rather > than any configured serial prefix. We have since [fixed the bug]( > https://github.com/letsencrypt/boulder/pull/3830). > > We first became aware of the problem at 17:52 UTC after our internal > alerting flagged invalid OCSP responses for certificates issued by our > monitoring systems, though the scale of the issue was not immediately > clear. We began investigating the root cause, identified the problem at > 19:26 UTC and immediately disabled the prefix validation feature in staging > and production. > > The bug was not caught during testing because the unittest accompanying > the initial PR did not cover the case of multiple acceptable prefixes. The > bug was not caught in our staging environment for two reasons: (1) Our > internal OCSP monitoring looks for HTTP 500s, but ignores OCSP > "unauthorized" responses, because large number of such responses can be > triggered externally by misconfigured clients; (2) Our end-to-end OCSP > monitoring tests were working in production, but not in staging. > > Remediation items: > > 1. Review our procedures for ensuring that all monitoring tools are > applied to both production and staging environments. > 2. Extend OCSP monitoring to include OCSP statuses (unauthorized, revoked, > ok, etc) in addition to HTTP statuses. > 3. Add alerts when fraction of unauthorized or revoked OCSP responses is > extremely high. > > Timeline: > > 2018-08-23 01:43 UTC - feature configured in staging > 2018-08-23 17:47 UTC - feature configured in production > 2018-08-23 19:31 UTC - feature disabled in staging > 2018-08-23 19:33 UTC - feature disabled in production > _______________________________________________ > dev-security-policy mailing list > dev-security-policy@lists.mozilla.org > https://lists.mozilla.org/listinfo/dev-security-policy > _______________________________________________ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy