Josh,

Thank you for submitting this incident report. I created a bug to track the
incident and remediation efforts:
https://bugzilla.mozilla.org/show_bug.cgi?id=1486650

- Wayne

On Fri, Aug 24, 2018 at 1:07 PM josh--- via dev-security-policy <
dev-security-policy@lists.mozilla.org> wrote:

> To see the original communication on our Community Forums, click here:
>
>
> https://community.letsencrypt.org/t/2018-08-23-ocsp-responder-incident/70350
>
> At 17:47 UTC on August 23rd, 2018 we deployed a configuration change to
> our OCSP responder service that resulted in 90% of traffic to our origin
> inaccurately receiving OCSP "unauthorized" statuses for valid OCSP
> requests. Most OCSP responses that were cached at our CDN prior to the
> incident were not affected. The change was reverted on 19:33 UTC the same
> day to resolve the problem, though CDN caching may have resulted in
> affected statuses being served for a limited period of time after
> resolution.
>
> The root technical cause of this incident was [a change](
> https://github.com/letsencrypt/boulder/pull/3815) developed during a
> previous incident in which malformed OCSP traffic was causing excessive
> strain on the OCSP responder. Unfortunately [a bug in the implementation](
> https://github.com/letsencrypt/boulder/issues/3829) improperly rejected
> OCSP requests unless they matched the last configured serial prefix rather
> than any configured serial prefix. We have since [fixed the bug](
> https://github.com/letsencrypt/boulder/pull/3830).
>
> We first became aware of the problem at 17:52 UTC after our internal
> alerting flagged invalid OCSP responses for certificates issued by our
> monitoring systems, though the scale of the issue was not immediately
> clear. We began investigating the root cause, identified the problem at
> 19:26 UTC and immediately disabled the prefix validation feature in staging
> and production.
>
> The bug was not caught during testing because the unittest accompanying
> the initial PR did not cover the case of multiple acceptable prefixes. The
> bug was not caught in our staging environment for two reasons: (1) Our
> internal OCSP monitoring looks for HTTP 500s, but ignores OCSP
> "unauthorized" responses, because large number of such responses can be
> triggered externally by misconfigured clients; (2) Our end-to-end OCSP
> monitoring tests were working in production, but not in staging.
>
> Remediation items:
>
> 1. Review our procedures for ensuring that all monitoring tools are
> applied to both production and staging environments.
> 2. Extend OCSP monitoring to include OCSP statuses (unauthorized, revoked,
> ok, etc) in addition to HTTP statuses.
> 3. Add alerts when fraction of unauthorized or revoked OCSP responses is
> extremely high.
>
> Timeline:
>
> 2018-08-23 01:43 UTC - feature configured in staging
> 2018-08-23 17:47 UTC - feature configured in production
> 2018-08-23 19:31 UTC - feature disabled in staging
> 2018-08-23 19:33 UTC - feature disabled in production
> _______________________________________________
> dev-security-policy mailing list
> dev-security-policy@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-security-policy
>
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to