On 05.12.2018 21:26, Ryan Sleevi wrote:
On Wed, Dec 5, 2018 at 7:53 AM Wojciech Trapczyński<wtrapczyn...@certum.pl>
wrote:

Ryan, thank you for your comment. The answers to your questions below:

Again, thank you for filing a good post-mortem.

I want to call out a number of positive things here rather explicitly, so
that it hopefully can serve as a future illustration from CAs:
* The timestamp included the times, as requested and required, which help
provide a picture as to how responsive the CA is
* It includes the details about the steps the CA actively took during the
investigation (e.g. within 1 hour, 50 minutes, the initial cause had been
identified)
* It demonstrates an approach that triages (10.11.2018 12:00), mitigates
(10.11.2018 18:00), and then further investigates (11.11.2018 07:30) the
holistic system. Short-term steps are taken (11.11.2018 19:30), followed by
longer term steps (19.11.2018)
* It provides rather detailed data about the problem, how the problem was
triggered, the scope of the impact, why it was possible, and what steps are
being taken.

That said, I can't say positive things without highlighting opportunities
for improvement:
* It appears you were aware of the issue beginning on 10.11.2018, but the
notification to the community was not until 03.12.2018 - that's a
significant gap. I see Wayne already raised it in
https://bugzilla.mozilla.org/show_bug.cgi?id=1511459#c1  and that has been
responded to inhttps://bugzilla.mozilla.org/show_bug.cgi?id=1511459#c2
* It appears, based on that bug and related discussion (
https://bugzilla.mozilla.org/show_bug.cgi?id=1511459#c2  ), that from
10.11.2018 01:05 (UTC±00:00) and 14.10.2018 07:35 (UTC±00:00) an invalid
CRL was being served. That seems relevant for the timeline, as it speaks to
the period of CRL non-compliance. In this regard, I think we're talking
about two different BR "violations" that share the same incident root cause
- a set of invalid certificates being published and a set of invalid CRLs
being published. Of these two, the latter is far more impactful than the
former, but it's unclear based on the report if the report was being made
for the former (certificates) rather than the latter (CRLs)

Beyond that, a few selected remarks below.


There are two things here: how we monitor our infrastructure and how our
software operates.

Our system for issuing and managing certificates and CRLs has module
responsible for monitor any issue which may occur during generating
certificate or CRL. The main task of this module is to inform us that
"something went wrong" during the process of issuing certificate or CRL.
In this case we have got notification that several CRLs had not been
published. This monitoring did not inform us about corrupted signature
in one CRL. It only indicated that there are some problems with CRLs. To
identify the source of the problem human action was required.

Based on your timeline, it appears the issue was introduced at 10.11.2018
01:05 and not alerted on until 10.11.2018 10:10. Is that correct? If so,
can you speak to why the delay between the issue and notification, and what
the target delay is with the improvements you're making? Understanding that
alerting is finding a balance between signal and noise, it does seem like a
rather large gap. It may be that this gap is reflective of 'on-call' or
'business hours', it may be a threshold in the number of failures, it may
have been some other cause, etc. Understanding a bit more can help here.



Yes, that is correct. This monitoring system that we are using in our software for issuing and managing certificates and CRLs has not notification feature. The requirement of reviewing events from it is a part of the procedure. In the other words, to detect any issue in this monitoring the human action is required. That is why we detected this issue with some delay.

Therefore, we have added tests to our main monitoring system and we receive notification in less than 5 minutes since the occurrence of the event.

Additionally, we have the main monitoring system with thousands of tests
of the whole infrastructure. For example, in the case of CRLs we have
tests like check HTTP status code, check downloading time, check
NextUpdate date and others. After the incident we have added tests which
allow us to quickly detect CRLs published with invalid signature (we are
using simple OpenSSL based script).

So, this is an example of a good response. It includes a statement that
requires trust ("we have ... thousands of tests"), but then provides
examples that demonstrate an understanding and awareness of the potential
issues.

Separate from the incident report, I think publishing or providing details
about these tests could be a huge benefit to the community, with an ideal
outcome of codifying them all as requirements that ALL CAs should perform.
This is where we go from "minimum required" to "best practice", and it
sounds like y'all are operating at a level that seeks to capture the spirit
and intent, and not just the letter, and that's the kind of ideal
requirement to codify and capture.



We are using Zabbix – The Enterprise-Class Open Source Network Monitoring Solution (https://www.zabbix.com/).

For ordinary tests we are using functions build-in Zabbix, for example:

- "Simple checks" for monitor things like ICMP ping, TCP/UDP service availability; - "Web scenarios" for monitor things like HTTP response status code, download speed, response time.

For uncommon tests that every CAs have to deal with we are using our own scripts that we embedded in Zabbix. For all tests we have defined sets of triggers that trigger appropriate actions.

Of course, we are willing to share details of our tests as a part of creating the best practices that all CAs should follow. I guess that a lot of CAs have similar tests in their infrastructure and sharing it will be valuable for all.

As I described in the incident report we also have improved the part of
the signing module responsible for verification of signature, because at
the time of failure it did not work properly.

This is an area where I think more detail could help. Understanding what
caused it to "not work properly" seems useful in understanding the issues
and how to mitigate. For example, it could be that "it did not work
properly" because "it was never configured to be enabled", it could be that
"it did not work properly" because "a bug was introduced and the code is
not tested", or.. really, any sort of explanation. Understanding why it
didn't work and how it's been improved helps everyone understand and,
hopefully, operationalize best practices.


The technical issue was in wrong calculation of the hash of the object. Unfortunately, this wrong calculation of hash was using during the verification of the signature as well. Therefore, the corrupted signatures were not detecting. As I described in the incident report, the tests of this software did not contain creation of the signature for such large CRL and for that reason it avoided detection until now.

The first thing we fixed was the signing module itself. We have made changes that allow us to correctly sign the large objects and verify its signatures in the correct way. Then, to eliminate risk we have decided to add a signature verification in another component of our system. It gives us certainty that even if the signing module fails once again, we do not repeat the same mistake and all invalid certificates or CRLS will be blocked.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to