Re: CA disclosure of revocations that exceed 5 days [Was: Re: Incident report D-TRUST: syntax error in one tls certificate]

Dimitris Zacharopoulos via dev-security-policy Fri, 30 Nov 2018 01:25:00 -0800


On 30/11/2018 1:49 π.μ., Ryan Sleevi wrote:

On Thu, Nov 29, 2018 at 4:03 PM Dimitris Zacharopoulos viadev-security-policy <dev-security-policy@lists.mozilla.org<mailto:dev-security-policy@lists.mozilla.org>> wrote:
    I didn't want to hijack the thread so here's a new one.


    Times and circumstances change.


You have to demonstrate that.


It's self-proved :-)

    When I brought this up at the Server
    Certificate Working Group of the CA/B Forum
    (https://cabforum.org/pipermail/servercert-wg/2018-September/000165.html),
there was no open disagreement from CAs.
Look at the discussion during Wayne’s ballot. Look at the discussionback when it was Jeremy’s ballot. The proposal was as simplified ascould be - modeled after 9.16.3 of the BRs. It would have allowed fora longer period - NOT an unbounded period, which is grossly negligentfor publicly trusted CAs.


Agreed.

    However, think about CAs that
    decide to extend the 5-days (at their own risk) because of
    extenuating
    circumstances. Doesn't this community want to know what these
    circumstances are and evaluate the gravity (or not) of the situation?
    The only way this could happen in a consistent way among CAs would
    be to
    require it in some kind of policy.
This already happens. This is a matter of the CA violating anycontracts or policies of the root store it is in, and is already beinghandled by those root stores - e.g. misissuance reports. What you’redescribing as a problem is already solved, as are the expectations forCAs - that violating requirements is a path to distrust.
The only “problem” you’re solving is giving CAs more time, and thereis zero demonstrable evidence, to date, about that being necessary orgood - and rich and ample evidence of it being bad.

I already mentioned that this is separate from the incident report (ofthe actual mis-issuance). We have repeatedly seen post-mortems that saythat for some specific cases (not all of them), the revocation ofcertificates will require more time. Even the underscore revocationdeadline creates problems for some large organizations as Jeremy pointedout. I understand the compatibility argument and CAs are doing theirbest to comply with the rules but you are advocating there should be noexceptions and you say that without having looked at specific evidencethat would be provided by CAs asking for exceptions. You would ratherhave Relying Parties loose their internet services from one of theFortune 500 companies. As a Relying Party myself, I would hate it if Icouldn't connect to my favorite online e-shop or bank or webmail. So I'mstill confused about which Relying Party we are trying to help/protectby requiring the immediate revocation of a Certificate that has 65characters in the OU field.

I also see your point that "if we start making exceptions..." it's toorisky. I'm just suggesting that there should be some tolerance forextended revocations (to help with collecting more information) whichdoesn't necessarily mean that we are dealing with a "bad" CA. I trustthe Mozilla module owner's judgement to balance that. If the communitybelieves that this problem is already solved, I'm happy with that :)

    > Phrased differently: You don't think large organizations are
    currently
    > capable, and believe the rest of the industry should accommodate
    that.

    "Tolerate" would probably be the word I'd use instead of
    "accommodate".
I chose accommodate, because you’d like the entire world to take onsystemic risk - and it is indeed systemic risk, to users especially -to benefit some large companies.
Why stop with revocation, though? Why not just let CAs define theirown validation methods of they think they’re equivalent? After all, ifwe can trust CAs to make good judgements on revocation, why can’t wealso trust them with validation? Some large companies struggle withour existing validation methods, why can’t we accommodate them?
That’s exactly what one of the arguments against restrictingvalidation methods was.
As I said, I think this discussion will not accomplish anythingproductive without a structured analysis of the data. Not anecdatafrom one or two incidents, but holistic - because for every 1 realneed, there may have been 9,999 unnecessary delays in revocation withreal risk.
How do CAs provide this? For *all* revocations, provide meaningfuldata. I do not see there being any value to discussing furtherextensions until we have systemic transparency in place, and I do notsee any good coming from trying to change at the same time as placingthat systemic transparency in place, because there’s no way to measurethe (negative) impact such change would have.

I don't see how data and evidence for "all revocations" somehow makesthings better, unless I misunderstood your proposal. It's not a balancedrequest. It would be a huge effort for CAs to write risk assessmentreports for each revocation. Why not focus on the rare cases whichjustifies the extra effort from CAs to write a disclosure letterrequesting more days for revocation? Why not add some rules on what'sthe minimum information that's expected for these cases? If you wantthis to be part of the incident report, that's fine.

The systemic transparency you are asking, as I understand it, would bem.d.s.p. We already see incident reports being published here. CAs whoseek more than 5 days for revoking affected certificates would disclosemore details about the specifics of these revocations.

    >
    > Do you believe these organizations could respond within 5 days if
    > their internet connectivity was lost?

    I think there is different impact. Losing network connectivity would
have "real" and large (i.e. all RPs) impact compared to installing a
    certificate with -say- 65 characters in the OU field which may cause
    very few problems to some RPs that want to use a certain web site.
So you do believe organizations are capable of making timely changeswhen necessary, and thus we aren’t discussing capabilities, butperceived necessity. And because some organizations have been misleadas to the role of CAs, and thus don’t feel its necessary, don’t feelthey should have to use that capability.
I’m not terribly sympathetic to that at all. As you mention, they canrespond when all RPs are affected, so they can respond when theircertificate is misissused and thus revoked.
    You describe it as a black/white issue. I understand your argument
    that
    other control areas will likely have issues but it always comes
    down to
    what impact and what damage these failed controls can produce.
    Layered
    controls and compensating controls in critical areas usually lower
    the
    risk of severe impact. The Internet is probably safe and will not
    break
    if for example a certificate with 65-character OU is used on a public
    web site. It's not the same as a CA issuing SHA1 Certificates with
    collision risk.
It absolutely is, and we have seen this time and time again. The CAsmost likely to argue the position you’re taking are the CAs that havehad the most issues.
Do we agree, at least, that any CA violating the BRs or Root Policiesputs the Internet ecosystem at risk?
It seems the core of your argument is how much risk should beacceptable, and the answer is none. Zero. The point of postmortems isto get us to a point where, as an industry, we’ve taken everyavailable step to reduce and eliminate that risk, by learning from ourcollective mistakes. Lives and businesses are on the line - a singlemistake can cost billions - and there’s no excuse for just shruggingand saying “well, yanno, there’s risk and there’s risk”

CAs are evaluated using schemes based on Risk Management. There is nozero risk. It's like saying there is 100% security. You can add controlsto minimize risk to acceptable levels. Even when mitigations are added,you have residual risk. However, layered controls and compensatingcontrols help to avoid disasters. I just don't believe it's black orwhite and I think the module owners probably agree with that statement(https://groups.google.com/d/msg/mozilla.dev.security.policy/tbSkcGHg1kA/CkrM6taBAwAJ).If that was the case, every single BR violation or Root Policy violationwould be treated as a trigger for a complete distrust.

Go read
https://zakird.com/papers/zlint.pdf to see a systemic, thorough,analysis that supports what I described to you, and disagrees withyour framing. We know what the warning signs are - and it’s continuedframing of “low” risk that collectively presents “severe” risk.

I wasn't aware of that paper, it contains valuable information, thankyou for sharing. Notice the abstract that says "We find that the numberof errors has drastically reduced since 2012. In 2017, only 0.02% ofcertificates have errors". To me, this is a positive indicator that theecosystem is continuously improving.




    >
    > Second, it presumes (incorrectly) that interoperability is not
    > something valuable. That is, if say the three existing, most
    popular
    > implementations all do not check whether or not it's longer than 64
    > characters (for example), and a fourth implementation would like to
    > come along, they cannot read the relevant standards and implement
    > something interoperable. This is because 'interoperability' is
    being
    > redefined as 'ignoring' the standard - which defeats the
    purposes of
    > standards to begin with. These choices - to permit deviations -
    > creates risks for the entire ecosystem, because there's no longer
    > interoperability. This is equally captured in
    > https://tools.ietf.org/html/draft-iab-protocol-maintenance-01
    >
    > The premise to all of this is that "CAs shouldn't have to follow
    > rules, browsers should just enforce them," which is shocking and
    > unfortunate. It's like saying "It's OK to lie about whatever you
    want,
    > as long as you don't get caught" - no, that line of thinking is
    just
    > as problematic for morality as it is for technical
    interoperability.
    > CAs that routinely violate the standards create risk, because they
    > have full trust on the Internet. If the argument is that the CA's
    > actions (of accidentally or deliberately introducing risk) is the
    > problem, but that we shouldn't worry about correcting the
    individual
    > certificate, that entirely misses the point that without correcting
    > the certificate, there's zero incentive to actually follow the
    > standards, and as a result, that creates risk for everyone.
    > Revocation, if you will, is the "less worse" alternative to
    complete
    > distrust - it only affects that single certificate, rather than
    every
    > one of the certificates the CA has issued. The alternative - not
    > revoking - simply says that it's better to look at distrust
    options,
    > and that's more risk for everyone.
    >

    I absolutely agree that interoperability is something valuable that
    should be pursued by the ecosystem. Browsers and the majority of CAs
    work in that direction. It's just the fact that if a browser strictly
    enforces a requirement from a standard (e.g. rejects a certificate
    that
    has an OU field with more than 64 characters), it makes a huge
    difference towards the goal for interoperability compared to a CA
    that
    just issues certificate with max of 64 characters in the OU. If
    browsers
    enforced these rules, the difference would be so big that the
    problematic certificate would be immediately discovered by the
    Subscriber, who would complain to the CA and the Certificate would
    most
    likely be revoked immediately since it wouldn't be usable.

I literally provided you an explanation for why what you’re describingis problematic and unreasonable. Please do re-read it. In a newsystem, sure, that’s be great - but the existing system absolutelypenalizes first movers.

Look at SC12 as an example. CAs would really like browsers to makethat change, because then they can have their customers blame browsersfor their misissuance. The customer is not going to say “Guess Ishould replace my cert”, but rather, blame the browser. The links Iprovided showed how CAs widespread disregard for the standards createdreal compatibility and security issues - and a browser just rejectingthem doesn’t actually fix it, because the site says “well, works inother browsers, so the bug must be the browsers, not mine.”

I have listened to this argument before but unfortunately it leadsnowhere. How badly are we interested in interop to justify being "thebad guys" and how "disruptive" will our actions be for Relying Parties?It is a very difficult problem to solve but the ecosystem has made progress:

- disclosure of intermediate CA Certificates
- identifying and fixing problematic OCSP responders

- increased supervision to the issued certificates with CT and lintersproviding public information about mis-issuances- browsers enforcing BR requirements with code (e.g. certificatevalidity duration)

With these controls in place, CAs are very much obligated to follow therules or face the consequences. Browsers use telemetry to detectviolations of the standards and create plans on addressing those issues.These plans usually include discussions in m.d.s.p. or the CA/B Forum inorder for the CAs to participate and create the necessary rules -alongwith the browsers- to address these incompatibilities.


    What I meant to say in my original argument is that the "damage"
    created
    by a certificate that fails to strictly comply with RFC5280 and
    the rest
    of the X.* standards, as long as popular browsers "allow it", is
    primarily an issue between a Subscriber (that maintains a web
    site), and
    the particular Relying Parties that want to establish a secure
    connection to that web site. That's not the entire Internet. This
    is why
    I compared it with "a situation where a site operator forgets to send
    the intermediate CA Certificate in the chain. These particular RPs
    will
    fail to get TLS working when they visit the Subscriber's web site".

It’s a perfect example of why your argument DOESN’T work. As Mozillahas shared in the CA/B Forum, people don’t fix their site - they blamethe browser, and keep on with the brokenness. Firefox is the onehaving to change to “accommodate” that.

Or, they might blame the CA for providing them a "thing" that doesn'twork with all major browsers :)

    Perhaps I have misunderstood your argument but when we are discussing
    about revocation timelines, it looks a little extreme to say that
    a CA
    claiming "some important reasons" (I'm not saying if they are valid
    reasons or not) for delaying a certificate revocation, that they have
    zero incentive to follow the standards.
It isn’t extreme, because even the incident reports from 2014/2015show exactly this argument being made. Your arguments themselvescontinue to show that, by suggesting that “only” the site is impacted.And yet, if every site is doing it because “only” that site isimpacted, you have the whole ecosystem doing it.
This myopic view of trying to assess per-Certificate is inherentlynon-scalable. You haven’t actually proposed any way to address that.What happens when a CA is doing 100 “exceptional” non-revocations?What about 10,000? We’ve seen examples of both discussed - so nothingis new here. Do we make CAs also pay penalty fees, so that thecommunity can ensure there is adequate staffing to investigate andreview this? If we do that, what’s to prevent CAs from just seeingthat as buying indulgences?

This statement underestimating the reflexes of the Root programs. Thereason for requiring disclosure is meant as a first step forunderstanding what's happening in reality and collect some meaningfuldata by policy. Once Mozilla collects enough information to make a safeestimation, the policy can be updated to allow or forbid certainsituations. If, for example, m.d.s.p. receives 10 or 20 revocationexception cases within a 12-month period and none of them is convincingto the community and module owners to justify the exception, the policycan be updated with clear rules about the risk of distrust if therevocation doesn't happen within 5 days. That would be a simple, clearrule. Does Mozilla have the information to make such an aggressive rulechange today? Maybe.

Your whole proposal breaks down at scale. It’s like asking “What’s theharm if I start stealing candy bars - after all, it’s only a candybar?” - without actually acknowledging the consequences of normalizingthat behavior. It tries to frame the conversation as being about a $1candy, which, while appealing, isn’t actually what is being discussed.
Maybe you’re blinded by optimism and faith in CAs. I think if you takea more realistic, grounded, and holistic view of the ecosystem - onethat considers we were where you propose to go 8 years ago (and it wasdisastrous for the ecosystem), one that considers this is a sharedcommons, and one that acknowledges the misaligned incentives - youwould realize we already know how and why this sort of suggestiondoesn’t actually work in practice, because we have been there, done that.
    > Finally, CAs are terrible at assessing the risk to RPs. For
    example,
    > negative serial numbers were prolific prior to the linters, and
    those
    > have issues in as much as they are, for some systems, irrevocable.
    > This is because those systems implemented the standards correctly -
    > serials are positive INTEGERs - yet had to account for the fact
    that
    > CAs are improperly encoding them, such as by "making" them positive
    > (adding the leading zero). This leading zero then doesn't get
    stripped
    > off when looking up by Issuer & Serial Number, because they're
    using
    > the "spec-correct" serial rather than the "issuer-broken" serial.
    > That's an example where the certificate "works", no report is
    filed,
    > but the security and ecosystem properties are fatally
    compromised. The
    > alternatives for such implementation are:
    > 1) Reject such certificates (but see above about market forces and
    > interoperability)
    > 2) Correct both the certificate and the CRL/OCSP serial number
    (which
    > then creates risk because you're not actually checking _any_
    > certificates true serial)
    > 3) Allow negative serial numbers (which then makes it harder for
    > others to do #1)
    >
    > As I said, CAs have been terrible at assessing risk to the
    ecosystem
    > for their decisions. The page at
    >
    
https://wiki.mozilla.org/SecurityEngineering/mozpkix-testing#Things_for_CAs_to_Fix

    > shows how bad such interoperability harms improvements - for
    example,
    > all of these hacks that Mozilla had to add in order to ship a more
    > secure, more efficient certificate verifier.

    As I said earlier, times change. The bar is raised, this industry
    matures day-after-day, things are hopefully improving
(security-wise).
You said that, without any systemic data, without any support. Havingthe same conversation tomorrow that we had today because, hey, “timeschange”, may even be true, but it isn’t productive in the least.

I already provided some facts that I believe assisted in the securityimprovement of the ecosystem. The paper you cited also agrees with thatstatement. It's an ongoing effort for continuous improvement.

I disagree that we’ve seen systemic improvements as a whole. There area few CAs trying to do better, but the incident reporting of todayclearly shows exactly what I’m saying - that the industry has notactually matured as you suggest. What has changed has largely beendriven by those outside CAs - whether those who were wanting to becomeCAs (Amazon with certlint) or those analyzing CA’s failures (ZLint).

If we truly care about the ecosystem, it doesn't really matter where thesystemic improvements come from. CAs and Browsers have contributed inthe Network Security Guidelines, the BRs (to improve and limitvalidation methods, add CAA and so much more). I agree we should expectevery CA to develop tools or use existing ones to ensure they arecomplying with all rules. We occasionally see some exceptions and thisis evaluated on a case-by-case basis. "Accidents" and mistakes do happenand as it has been discussed in the past, it's collective failures thatpose the greatest risk and we have seen hard decisions being made tominimize or eliminate these risks.

    In conclusion, after repeatedly seeing CAs requesting or effectively
    taking more time to revoke certificates that the existing
    requirements,
    I believe that a policy rule that would require CAs to disclose
    revocation cases requiring more than 5 days to complete (i.e.
    revoke the
    certificate), provided that the CA submits risk analysis information
    after working with the affected Subscriber(s), is a reasonable way
    forward.
I think it is grossly negligent and irresponsible, and is onlyreasonable if one ignores the past two decades (such as by gliblysaying “times change”). A proposal based on submitting risk analysismerely outsources the costs from the Subscriber onto this communityand RPs in general - who could easily become consumed with readingthousands upon thousands of these. Such an act is incredibly hostileto meaningful trust in CAs and the ecosystem.
Far more compelling is to reduce the timeframe that CAs can “go rogue”not revoking, by reducing the overall certificate lifetime. Byimproving the rate at which certificates are replaced, the “hardship”you spoke to (though seemingly agree it’s not actually there) can bereduced. This can be done without introducing the need for costly, andsubjective, risk assessments or “exceptions”.
In any event, I think it’s unproductive to try to bring thisconversation up without concrete data. If multiple CAs committed topublishing all of their revocation data in a systemic way - reasons,hardships, etc (NOT just the exceptional cases) - and committed tomaking funds available to be used to rigorously analyze this (e.g.funding Mozilla to hire someone for this, funding peer reviewedpapers) - it might be worth revisiting. Then we could have concretedata that could, for example, show that these “hardships” areone-in-a-million (certs), and more reflective of poor organizationcontrols by CAs and Subscribers, rather than a systemic problem toaddress.

I already stated my reasoning for keeping the disclosure just forexceptions. Currently, the only systemic technical way of providingsomething about the revocation is the revocation reason, and that'slimited by RFC5280.

I also protest against the "grossly negligent and irresponsible" partand I'm afraid statements like that alienate people from participatingand proposing anything. Simply disagreeing would ultimately have thesame effect in this conversation. You have already provided goodarguments against my proposal for people to evaluate.



Dimitris.
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Re: CA disclosure of revocations that exceed 5 days [Was: Re: Incident report D-TRUST: syntax error in one tls certificate]

Reply via email to