m.d.s.p community, Google Trust Services just filed
https://bugzilla.mozilla.org/show_bug.cgi?id=1630040 which contains the
same information as the report that follows.

>From 2020-04-08 16:25 UTC to 2020-04-09 05:40 UTC, Google Trust Services'
EJBCA based CAs (GIAG4, GIAG4ECC, GTSY1-4) served empty OCSP data which led
the OCSP responders to return unauthorized.

These CAs exist for issuance of custom certificate profiles and
certificates for test sites for inactive roots. Our primary CAs (GTS CA 1O1
and GTS CA 1D2) were unaffected. The problem self-corrected, but we have
added safeguards to prevent recurrence.

1. How your CA first became aware of the problem (e.g. via a problem report
submitted to your Problem Reporting Mechanism, a discussion in
mozilla.dev.security.policy, a Bugzilla bug, or internal self-audit), and
the time and date.

Monitoring detected the issue on 2020-04-08 at 16:35 UTC. The root cause
was identified within hours. The issue was automatically remediated in the
next generation and push to CDN cycle while debugging and fixes were
ongoing.

2. A timeline of the actions your CA took in response. A timeline is a
date-and-time-stamped sequence of all relevant events. This may include
events before the incident was reported, such as when a particular
requirement became applicable, or a document changed, or a bug was
introduced, or an audit was done.

2020-04-08, 11:29 UTC - Scheduled system update begins
2020-04-08, 14:00 UTC - Incorrect OCSP archives are generated
2020-04-08, 15:03 UTC - Scheduled system update concludes
2020-04-08, 16:20 UTC - Incorrect OCSP responses pushed to CDN
2020-04-08, 16:35 UTC - First production monitoring alert fires
2020-04-08, 22:00 UTC - Correct OCSP archives are generated automatically
2020-04-09, 00:20 UTC - Correct OCSP responses pushed to CDN
2020-04-09, 05:40 UTC - Monitoring confirms all probes are passing

3. Whether your CA has stopped, or has not yet stopped, issuing
certificates with the problem. A statement that you have will be considered
a pledge to the community; a statement that you have not requires an
explanation.

The affected CAs are only used for infrequent and manual custom certificate
issuance. No certificate issuance aside from a manually issued post update
test certificate to validate the upgrade to resolve the issue took place
during this period. The issue in question also was specific to refreshing
OCSP responses and not certificate issuance.

4. A summary of the problematic certificates. For each problem: number of
certs, and the date the first and last certs with that problem were issued.

No certificate issuance aside from a manually issued post update test
certificate to validate the upgrade to resolve the issue took place during
this period. The test certificate was a valid and fully compliant issuance.

5. The complete certificate data for the problematic certificates. The
recommended way to provide this is to ensure each certificate is logged to
CT and then list the fingerprints or crt.sh IDs, either in the report or as
an attached spreadsheet, with one list per distinct problem.

No certificate issuance aside from the manually issued post update test
certificate to validate the the upgrade.

6. Explanation about how and why the mistakes were made or bugs introduced,
and how they avoided detection until now.

Our creation of OCSP responses and packaging them for serving is designed
to fail if any sub-command fails using set -e. However, if the function
call is part of an AND or OR sequence (ie. using '&&' or '||' control
operators), the set -e is suppressed inside the function.

The tool we use to fetch OCSP responses from EJBCA correctly returned a
non-zero exit code (due to no OCSP responses being generated because EJBCA
was not running), but because it was called inside a function with its own
error handling (using && syntax), the script continued without handling the
error properly and wrongly used empty tar.gz files with no responses in
them. The bug had existed for multiple years as a potential race condition
and we did not encounter it previously.

Quality tests are executed before publication to the CDN, however, those
tests accommodate empty responses as a valid condition because it is
something that can and does happen.

This condition did not repeat on the following update of the OCSP
responses. As a result the next update resolved the issue. Our monitoring
caught the issue enabling expedient root cause analysis and resolution.

7. List of steps your CA is taking to resolve the situation and ensure such
issuance will not be repeated in the future, accompanied with a timeline of
when your CA expects to accomplish these things.

No certificate issuance aside from a valid manually issued post update test
certificate to validate the upgrade took place during this period.

The logic error that led to incorrect OCSP responses being served has been
corrected, is checked in and in production. Additionally, checks have been
added to ensure that bad data cannot replace known good data.

We reviewed all existing monitoring of response generation and publishing
and found no gaps.

A review of similar code has also been conducted to ensure we do not have
other instances where similar logic could incorrectly suppress errors.

The only non-expired and revoked certificates under these CAs are used by
our six demo sites.

Users or automation using these sites for testing may have interpreted the
unauthorized responses to mean these revoked demo certificates were to be
considered valid during the window in which bad data was served.

The issue was limited to OCSP handling and CRL data was correct during the
same period.

No additional improvements are outstanding at this time.

--
Andy Warner
Google Trust Services

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to