I wanted to follow up with our findings and a summary of this issue for the 
community. 

Bellow you will see a detail on what happened and how we resolved the issue, 
hopefully this will help explain what hapened and potentially others not 
encounter a similar issue.

Summary
-------
January 19th, at 08:40 UTC, a code push to improve OCSP generation for a subset 
of the Google operated Certificate Authorities was initiated. The change was 
related to the packaging of generated OCSP responses. The first time this 
change was invoked in production was January 19th at 16:40 UTC. 

NOTE: The publication of new revocation information to all geographies can take 
up to 6 hours to propagate. Additionally, clients and middle-boxes commonly 
implement caching behavior. This results in a large window where clients may 
have begun to observe the outage.

NOTE: Most modern web browsers “soft-fail” in response to OCSP server 
availability issues, masking outages. Firefox, however, supports an advanced 
option that allows users to opt-in to “hard-fail” behavior for revocation 
checking. An unknown percentage of Firefox users enable this setting. We 
believe most users who were impacted by the outage were these Firefox users.

About 9 hours after the deployment of the change began (2018-01-20 01:36 UTC) a 
user on Twitter mentions that they were having problems with their hard-fail 
OCSP checking configuration in Firefox when visiting Google properties. This 
tweet and the few that followed during the outage period were not noticed by 
any Google employees until after the incident’s post-mortem investigation had 
begun. 

About 1 day and 22 hours after the push was initiated (2018-01-21 15:07 UTC), a 
user posted a message to the mozilla.dev.security.policy mailing list where 
they mention they too are having problems with their hard-fail configuration in 
Firefox when visiting Google properties.

About two days after the push was initiated, a Google employee discovered the 
post and opened a ticket (2018-01-21 16:10 UTC). This triggered the remediation 
procedures, which began in under an hour.

The issue was resolved about 2 days and 6 hours from the time it was introduced 
(2018-01-21 22:56 UTC). Once Google became aware of the issue, it took 1 hour 
and 55 minutes to resolve the issue, and an additional 4 hours and 51 minutes 
for the fix to be completely deployed.

No customer reports regarding this issue were sent to the notification 
addresses listed in Google's CPSs or on the repository websites for the 
duration of the outage. This extended the duration of the outage. 

Background
----------
Google's OCSP Infrastructure works by generating OCSP responses in batches, 
with each batch being made up of the certificates issued by an individual CA.

In the case of GIAG2, this batch is produced in chunks of certificates issued 
in the last 370 days. For each chunk, the GIAG2 CA is asked to produce the 
corresponding OCSP responses, the results of which are placed into a separate 
.tar file.

The issuer of GIAG2 has chosen to issue new certificates to GIAG2 periodically, 
as a result GIAG2 has multiple certificates. Two of these certificates no 
longer have unexpired certificates associated with them. As a result, and as 
expected, the CA does not produce responses for the corresponding periods.

All .tar files produced during this process are then concatenated with the 
-concatenate command in GNU tar. This produces a single .tar file containing 
all of the OCSP responses for the given Certificate Authority, then this .tar 
file is distributed to our global CDN infrastructure for serving.

A change was made in how we batch these responses, specifically instead of 
outputting many .tar files within a batch, a concatenation was of all tar files 
was produced.

The change in question triggered an unexpected behaviour in GNU tar which then 
manifested as an empty tarball. These "empty" updates ended up being 
distributed to our global CDN, effectively dropping some responses, while 
continuing to serve responses for other CAs.

During testing of the change, this behaviour was not detected, as the tests did 
not cover the scenario in which some chunks did not contain unexpired 
certificates.

Findings
--------
- The outage only impacted sites with TLS certificates issued by the GIAG2 CA 
as it was the only CA that met the required pre-conditions of the bug. 
- The bug that introduced this failure manifested itself as an empty container 
of OCSP responses. The root cause of the issue was an unexpected behavior of 
GNU tar relating to concatenating tar files.
- The outage was observed by revocation service monitoring as  “unknown 
certificate” (HTTP 404) errors. HTTP 404 errors are expected in OCSP responder 
operations; they typically are the result of poorly configured clients. These 
events are monitored and a threshold does exist for an on-call escalation.
- Due to a configuration error the designated Google team did not receive an 
escalation message.
- External users did not use the contact details Google provided in the CPS.

Remediation Plan
----------------
- A bug fix has been applied to prevent the same issue from happening again.
- Test cases looking for a minimum number of OCSP responses in each tar were 
added to the test automation suites to catch similar issues in the future.
- The monitoring system that was misconfigured was updated to use the correct 
address for escalations.
- Both the Google Trust Services CPS (found on pki.goog) and the Google CPS 
(found on pki.google.com) have been updated to make it clear what email address 
is the most expedient path to reach the PKI team for non-security incidents.
- The Google PKI repository page was updated to show contact details in the 
same way the Google Trust Services repository page already did in a hope to 
help users find a path of escalation.
- The wizard that is returned for mails to the security email address has been 
updated to also include an explicit option for issues related to the “Google 
Certificate Authority” in the hopes of helping users who choose this path of 
escalation.
- Existing procedures that are relied upon for periodic verification of 
effective escalation have been updated to include unknown certificate checking.

_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to