Ryan,

Wayne and I have been discussing making various improvements to 1.5.2
mandatory for all CAs.  I've made a few improvements to DigiCert's CPSs in
this area, but things probably still could be better.  There will probably be
a CA/B ballot in this area soon.

DigiCert's 1.5.2 has our support email address, and our Certificate Problem 
Report email (which I recently added).  That doesn't really cover everything 
(yet).

It looks like GTS 1.5.2 splits things into security (including CPRs), 
non-security
requests.

I didn't chase down any other 1.5.2's yet, but it'd be interesting to hear what
other CAs have here.  I suspect most only have one address for everything.

Something to keep in mind once the CA/B thread shows up.

-Tim

> -----Original Message-----
> From: dev-security-policy [mailto:dev-security-policy-
> bounces+tim.hollebeek=digicert....@lists.mozilla.org] On Behalf Of Ryan
> Hurst via dev-security-policy
> Sent: Wednesday, February 21, 2018 9:53 PM
> To: mozilla-dev-security-pol...@lists.mozilla.org
> Subject: Re: Google OCSP service down
> 
> I wanted to follow up with our findings and a summary of this issue for the
> community.
> 
> Bellow you will see a detail on what happened and how we resolved the issue,
> hopefully this will help explain what hapened and potentially others not
> encounter a similar issue.
> 
> Summary
> -------
> January 19th, at 08:40 UTC, a code push to improve OCSP generation for a
> subset of the Google operated Certificate Authorities was initiated. The 
> change
> was related to the packaging of generated OCSP responses. The first time this
> change was invoked in production was January 19th at 16:40 UTC.
> 
> NOTE: The publication of new revocation information to all geographies can
> take up to 6 hours to propagate. Additionally, clients and middle-boxes
> commonly implement caching behavior. This results in a large window where
> clients may have begun to observe the outage.
> 
> NOTE: Most modern web browsers “soft-fail” in response to OCSP server
> availability issues, masking outages. Firefox, however, supports an advanced
> option that allows users to opt-in to “hard-fail” behavior for revocation
> checking. An unknown percentage of Firefox users enable this setting. We
> believe most users who were impacted by the outage were these Firefox users.
> 
> About 9 hours after the deployment of the change began (2018-01-20 01:36
> UTC) a user on Twitter mentions that they were having problems with their
> hard-fail OCSP checking configuration in Firefox when visiting Google
> properties. This tweet and the few that followed during the outage period were
> not noticed by any Google employees until after the incident’s post-mortem
> investigation had begun.
> 
> About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 UTC),
> a user posted a message to the mozilla.dev.security.policy mailing list where
> they mention they too are having problems with their hard-fail configuration 
> in
> Firefox when visiting Google properties.
> 
> About two days after the push was initiated, a Google employee discovered the
> post and opened a ticket (2018-01-21 16:10 UTC). This triggered the
> remediation procedures, which began in under an hour.
> 
> The issue was resolved about 2 days and 6 hours from the time it was
> introduced (2018-01-21 22:56 UTC). Once Google became aware of the issue, it
> took 1 hour and 55 minutes to resolve the issue, and an additional 4 hours and
> 51 minutes for the fix to be completely deployed.
> 
> No customer reports regarding this issue were sent to the notification
> addresses listed in Google's CPSs or on the repository websites for the 
> duration
> of the outage. This extended the duration of the outage.
> 
> Background
> ----------
> Google's OCSP Infrastructure works by generating OCSP responses in batches,
> with each batch being made up of the certificates issued by an individual CA.
> 
> In the case of GIAG2, this batch is produced in chunks of certificates issued 
> in
> the last 370 days. For each chunk, the GIAG2 CA is asked to produce the
> corresponding OCSP responses, the results of which are placed into a separate
> .tar file.
> 
> The issuer of GIAG2 has chosen to issue new certificates to GIAG2 
> periodically,
> as a result GIAG2 has multiple certificates. Two of these certificates no 
> longer
> have unexpired certificates associated with them. As a result, and as 
> expected,
> the CA does not produce responses for the corresponding periods.
> 
> All .tar files produced during this process are then concatenated with the -
> concatenate command in GNU tar. This produces a single .tar file containing 
> all
> of the OCSP responses for the given Certificate Authority, then this .tar 
> file is
> distributed to our global CDN infrastructure for serving.
> 
> A change was made in how we batch these responses, specifically instead of
> outputting many .tar files within a batch, a concatenation was of all tar 
> files
> was produced.
> 
> The change in question triggered an unexpected behaviour in GNU tar which
> then manifested as an empty tarball. These "empty" updates ended up being
> distributed to our global CDN, effectively dropping some responses, while
> continuing to serve responses for other CAs.
> 
> During testing of the change, this behaviour was not detected, as the tests 
> did
> not cover the scenario in which some chunks did not contain unexpired
> certificates.
> 
> Findings
> --------
> - The outage only impacted sites with TLS certificates issued by the GIAG2 CA
> as it was the only CA that met the required pre-conditions of the bug.
> - The bug that introduced this failure manifested itself as an empty 
> container of
> OCSP responses. The root cause of the issue was an unexpected behavior of
> GNU tar relating to concatenating tar files.
> - The outage was observed by revocation service monitoring as  “unknown
> certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP
> responder operations; they typically are the result of poorly configured 
> clients.
> These events are monitored and a threshold does exist for an on-call
> escalation.
> - Due to a configuration error the designated Google team did not receive an
> escalation message.
> - External users did not use the contact details Google provided in the CPS.
> 
> Remediation Plan
> ----------------
> - A bug fix has been applied to prevent the same issue from happening again.
> - Test cases looking for a minimum number of OCSP responses in each tar were
> added to the test automation suites to catch similar issues in the future.
> - The monitoring system that was misconfigured was updated to use the
> correct address for escalations.
> - Both the Google Trust Services CPS (found on pki.goog) and the Google CPS
> (found on pki.google.com) have been updated to make it clear what email
> address is the most expedient path to reach the PKI team for non-security
> incidents.
> - The Google PKI repository page was updated to show contact details in the
> same way the Google Trust Services repository page already did in a hope to
> help users find a path of escalation.
> - The wizard that is returned for mails to the security email address has been
> updated to also include an explicit option for issues related to the “Google
> Certificate Authority” in the hopes of helping users who choose this path of
> escalation.
> - Existing procedures that are relied upon for periodic verification of 
> effective
> escalation have been updated to include unknown certificate checking.
> 
> _______________________________________________
> dev-security-policy mailing list
> dev-security-policy@lists.mozilla.org
> https://clicktime.symantec.com/a/1/c7XVow9dpuj8IcTSi3RUsAZNao2vvQpjx50
> I-L-Vues=?d=a8bGh4U_daa8sZ6NrNFYldn92rRny4FeSmGVut8w-
> EpNntcoPemdf815YVvwKHuqoKWrFl-_FF88KvI-
> g6MtPoT7dR8X0p7jIOiMMzFB1Oo7HjzsAY1_9lqhZrLywcjqWbk13D_p3Ll4Lsel0
> FbCfxQg8ZRva7LmdOqP_8fxd4j4zZQZtuK1IaD6sXqMG0L7ytNcn6rF2IUFRa4Qa
> VWZK1TzJXCjW_OddQll8kDyKRRM_ygs1cq6S-
> igplPwN_yuWgdTc7_rIz0lzmwwvaaTuM20kuHGNPwWaFXn3pVW9313nUNiXz
> BLAr8DV4QEgnaRqD_CLgMftm7WfKblze0HRF-
> N45Bld6PgwdHDi2xobKs0BSWDW5tOuJmzbtPmfPvBxSTMduaXRBXTQAKl4zf1q
> iD0rIGhSVrdmJCz9a69KaAmJjoVcwKfn9h4rwU5h2ydzQ%3D%3D&u=https%3A
> %2F%2Flists.mozilla.org%2Flistinfo%2Fdev-security-policy

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to