On Wed, Dec 5, 2018 at 7:53 AM Wojciech Trapczyński <wtrapczyn...@certum.pl>
wrote:

> Ryan, thank you for your comment. The answers to your questions below:
>

Again, thank you for filing a good post-mortem.

I want to call out a number of positive things here rather explicitly, so
that it hopefully can serve as a future illustration from CAs:
* The timestamp included the times, as requested and required, which help
provide a picture as to how responsive the CA is
* It includes the details about the steps the CA actively took during the
investigation (e.g. within 1 hour, 50 minutes, the initial cause had been
identified)
* It demonstrates an approach that triages (10.11.2018 12:00), mitigates
(10.11.2018 18:00), and then further investigates (11.11.2018 07:30) the
holistic system. Short-term steps are taken (11.11.2018 19:30), followed by
longer term steps (19.11.2018)
* It provides rather detailed data about the problem, how the problem was
triggered, the scope of the impact, why it was possible, and what steps are
being taken.

That said, I can't say positive things without highlighting opportunities
for improvement:
* It appears you were aware of the issue beginning on 10.11.2018, but the
notification to the community was not until 03.12.2018 - that's a
significant gap. I see Wayne already raised it in
https://bugzilla.mozilla.org/show_bug.cgi?id=1511459#c1 and that has been
responded to in https://bugzilla.mozilla.org/show_bug.cgi?id=1511459#c2
* It appears, based on that bug and related discussion (
https://bugzilla.mozilla.org/show_bug.cgi?id=1511459#c2 ), that from
10.11.2018 01:05 (UTC±00:00) and 14.10.2018 07:35 (UTC±00:00) an invalid
CRL was being served. That seems relevant for the timeline, as it speaks to
the period of CRL non-compliance. In this regard, I think we're talking
about two different BR "violations" that share the same incident root cause
- a set of invalid certificates being published and a set of invalid CRLs
being published. Of these two, the latter is far more impactful than the
former, but it's unclear based on the report if the report was being made
for the former (certificates) rather than the latter (CRLs)

Beyond that, a few selected remarks below.


> There are two things here: how we monitor our infrastructure and how our
> software operates.
>
> Our system for issuing and managing certificates and CRLs has module
> responsible for monitor any issue which may occur during generating
> certificate or CRL. The main task of this module is to inform us that
> "something went wrong" during the process of issuing certificate or CRL.
> In this case we have got notification that several CRLs had not been
> published. This monitoring did not inform us about corrupted signature
> in one CRL. It only indicated that there are some problems with CRLs. To
> identify the source of the problem human action was required.
>

Based on your timeline, it appears the issue was introduced at 10.11.2018
01:05 and not alerted on until 10.11.2018 10:10. Is that correct? If so,
can you speak to why the delay between the issue and notification, and what
the target delay is with the improvements you're making? Understanding that
alerting is finding a balance between signal and noise, it does seem like a
rather large gap. It may be that this gap is reflective of 'on-call' or
'business hours', it may be a threshold in the number of failures, it may
have been some other cause, etc. Understanding a bit more can help here.


> Additionally, we have the main monitoring system with thousands of tests
> of the whole infrastructure. For example, in the case of CRLs we have
> tests like check HTTP status code, check downloading time, check
> NextUpdate date and others. After the incident we have added tests which
> allow us to quickly detect CRLs published with invalid signature (we are
> using simple OpenSSL based script).
>

So, this is an example of a good response. It includes a statement that
requires trust ("we have ... thousands of tests"), but then provides
examples that demonstrate an understanding and awareness of the potential
issues.

Separate from the incident report, I think publishing or providing details
about these tests could be a huge benefit to the community, with an ideal
outcome of codifying them all as requirements that ALL CAs should perform.
This is where we go from "minimum required" to "best practice", and it
sounds like y'all are operating at a level that seeks to capture the spirit
and intent, and not just the letter, and that's the kind of ideal
requirement to codify and capture.


> As I described in the incident report we also have improved the part of
> the signing module responsible for verification of signature, because at
> the time of failure it did not work properly.
>

This is an area where I think more detail could help. Understanding what
caused it to "not work properly" seems useful in understanding the issues
and how to mitigate. For example, it could be that "it did not work
properly" because "it was never configured to be enabled", it could be that
"it did not work properly" because "a bug was introduced and the code is
not tested", or.. really, any sort of explanation. Understanding why it
didn't work and how it's been improved helps everyone understand and,
hopefully, operationalize best practices.
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to