On 30/11/2018 1:49 π.μ., Ryan Sleevi wrote:


On Thu, Nov 29, 2018 at 4:03 PM Dimitris Zacharopoulos via dev-security-policy <dev-security-policy@lists.mozilla.org <mailto:dev-security-policy@lists.mozilla.org>> wrote:

    I didn't want to hijack the thread so here's a new one.


    Times and circumstances change.


You have to demonstrate that.

It's self-proved :-)


    When I brought this up at the Server
    Certificate Working Group of the CA/B Forum
    (https://cabforum.org/pipermail/servercert-wg/2018-September/000165.html),

there was no open disagreement from CAs.

Look at the discussion during Wayne’s ballot. Look at the discussion back when it was Jeremy’s ballot. The proposal was as simplified as could be - modeled after 9.16.3 of the BRs. It would have allowed for a longer period - NOT an unbounded period, which is grossly negligent for publicly trusted CAs.

Agreed.


    However, think about CAs that
    decide to extend the 5-days (at their own risk) because of
    extenuating
    circumstances. Doesn't this community want to know what these
    circumstances are and evaluate the gravity (or not) of the situation?
    The only way this could happen in a consistent way among CAs would
    be to
    require it in some kind of policy.


This already happens. This is a matter of the CA violating any contracts or policies of the root store it is in, and is already being handled by those root stores - e.g. misissuance reports. What you’re describing as a problem is already solved, as are the expectations for CAs - that violating requirements is a path to distrust.

The only “problem” you’re solving is giving CAs more time, and there is zero demonstrable evidence, to date, about that being necessary or good - and rich and ample evidence of it being bad.

I already mentioned that this is separate from the incident report (of the actual mis-issuance). We have repeatedly seen post-mortems that say that for some specific cases (not all of them), the revocation of certificates will require more time. Even the underscore revocation deadline creates problems for some large organizations as Jeremy pointed out. I understand the compatibility argument and CAs are doing their best to comply with the rules but you are advocating there should be no exceptions and you say that without having looked at specific evidence that would be provided by CAs asking for exceptions. You would rather have Relying Parties loose their internet services from one of the Fortune 500 companies. As a Relying Party myself, I would hate it if I couldn't connect to my favorite online e-shop or bank or webmail. So I'm still confused about which Relying Party we are trying to help/protect by requiring the immediate revocation of a Certificate that has 65 characters in the OU field.

I also see your point that "if we start making exceptions..." it's too risky. I'm just suggesting that there should be some tolerance for extended revocations (to help with collecting more information) which doesn't necessarily mean that we are dealing with a "bad" CA. I trust the Mozilla module owner's judgement to balance that. If the community believes that this problem is already solved, I'm happy with that :)


    > Phrased differently: You don't think large organizations are
    currently
    > capable, and believe the rest of the industry should accommodate
    that.

    "Tolerate" would probably be the word I'd use instead of
    "accommodate".


I chose accommodate, because you’d like the entire world to take on systemic risk - and it is indeed systemic risk, to users especially - to benefit some large companies.

Why stop with revocation, though? Why not just let CAs define their own validation methods of they think they’re equivalent? After all, if we can trust CAs to make good judgements on revocation, why can’t we also trust them with validation? Some large companies struggle with our existing validation methods, why can’t we accommodate them?

That’s exactly what one of the arguments against restricting validation methods was.

As I said, I think this discussion will not accomplish anything productive without a structured analysis of the data. Not anecdata from one or two incidents, but holistic - because for every 1 real need, there may have been 9,999 unnecessary delays in revocation with real risk.

How do CAs provide this? For *all* revocations, provide meaningful data. I do not see there being any value to discussing further extensions until we have systemic transparency in place, and I do not see any good coming from trying to change at the same time as placing that systemic transparency in place, because there’s no way to measure the (negative) impact such change would have.

I don't see how data and evidence for "all revocations" somehow makes things better, unless I misunderstood your proposal. It's not a balanced request. It would be a huge effort for CAs to write risk assessment reports for each revocation. Why not focus on the rare cases which justifies the extra effort from CAs to write a disclosure letter requesting more days for revocation? Why not add some rules on what's the minimum information that's expected for these cases? If you want this to be part of the incident report, that's fine.

The systemic transparency you are asking, as I understand it, would be m.d.s.p. We already see incident reports being published here. CAs who seek more than 5 days for revoking affected certificates would disclose more details about the specifics of these revocations.


    >
    > Do you believe these organizations could respond within 5 days if
    > their internet connectivity was lost?

    I think there is different impact. Losing network connectivity would
have "real" and large (i.e. all RPs) impact compared to installing a
    certificate with -say- 65 characters in the OU field which may cause
    very few problems to some RPs that want to use a certain web site.


So you do believe organizations are capable of making timely changes when necessary, and thus we aren’t discussing capabilities, but perceived necessity. And because some organizations have been mislead as to the role of CAs, and thus don’t feel its necessary, don’t feel they should have to use that capability.

I’m not terribly sympathetic to that at all. As you mention, they can respond when all RPs are affected, so they can respond when their certificate is misissused and thus revoked.

    You describe it as a black/white issue. I understand your argument
    that
    other control areas will likely have issues but it always comes
    down to
    what impact and what damage these failed controls can produce.
    Layered
    controls and compensating controls in critical areas usually lower
    the
    risk of severe impact. The Internet is probably safe and will not
    break
    if for example a certificate with 65-character OU is used on a public
    web site. It's not the same as a CA issuing SHA1 Certificates with
    collision risk.


It absolutely is, and we have seen this time and time again. The CAs most likely to argue the position you’re taking are the CAs that have had the most issues.

Do we agree, at least, that any CA violating the BRs or Root Policies puts the Internet ecosystem at risk?

It seems the core of your argument is how much risk should be acceptable, and the answer is none. Zero. The point of postmortems is to get us to a point where, as an industry, we’ve taken every available step to reduce and eliminate that risk, by learning from our collective mistakes. Lives and businesses are on the line - a single mistake can cost billions - and there’s no excuse for just shrugging and saying “well, yanno, there’s risk and there’s risk”


CAs are evaluated using schemes based on Risk Management. There is no zero risk. It's like saying there is 100% security. You can add controls to minimize risk to acceptable levels. Even when mitigations are added, you have residual risk. However, layered controls and compensating controls help to avoid disasters. I just don't believe it's black or white and I think the module owners probably agree with that statement (https://groups.google.com/d/msg/mozilla.dev.security.policy/tbSkcGHg1kA/CkrM6taBAwAJ). If that was the case, every single BR violation or Root Policy violation would be treated as a trigger for a complete distrust.

Go read
https://zakird.com/papers/zlint.pdf to see a systemic, thorough, analysis that supports what I described to you, and disagrees with your framing. We know what the warning signs are - and it’s continued framing of “low” risk that collectively presents “severe” risk.

I wasn't aware of that paper, it contains valuable information, thank you for sharing. Notice the abstract that says "We find that the number of errors has drastically reduced since 2012. In 2017, only 0.02% of certificates have errors". To me, this is a positive indicator that the ecosystem is continuously improving.




    >
    > Second, it presumes (incorrectly) that interoperability is not
    > something valuable. That is, if say the three existing, most
    popular
    > implementations all do not check whether or not it's longer than 64
    > characters (for example), and a fourth implementation would like to
    > come along, they cannot read the relevant standards and implement
    > something interoperable. This is because 'interoperability' is
    being
    > redefined as 'ignoring' the standard - which defeats the
    purposes of
    > standards to begin with. These choices - to permit deviations -
    > creates risks for the entire ecosystem, because there's no longer
    > interoperability. This is equally captured in
    > https://tools.ietf.org/html/draft-iab-protocol-maintenance-01
    >
    > The premise to all of this is that "CAs shouldn't have to follow
    > rules, browsers should just enforce them," which is shocking and
    > unfortunate. It's like saying "It's OK to lie about whatever you
    want,
    > as long as you don't get caught" - no, that line of thinking is
    just
    > as problematic for morality as it is for technical
    interoperability.
    > CAs that routinely violate the standards create risk, because they
    > have full trust on the Internet. If the argument is that the CA's
    > actions (of accidentally or deliberately introducing risk) is the
    > problem, but that we shouldn't worry about correcting the
    individual
    > certificate, that entirely misses the point that without correcting
    > the certificate, there's zero incentive to actually follow the
    > standards, and as a result, that creates risk for everyone.
    > Revocation, if you will, is the "less worse" alternative to
    complete
    > distrust - it only affects that single certificate, rather than
    every
    > one of the certificates the CA has issued. The alternative - not
    > revoking - simply says that it's better to look at distrust
    options,
    > and that's more risk for everyone.
    >

    I absolutely agree that interoperability is something valuable that
    should be pursued by the ecosystem. Browsers and the majority of CAs
    work in that direction. It's just the fact that if a browser strictly
    enforces a requirement from a standard (e.g. rejects a certificate
    that
    has an OU field with more than 64 characters), it makes a huge
    difference towards the goal for interoperability compared to a CA
    that
    just issues certificate with max of 64 characters in the OU. If
    browsers
    enforced these rules, the difference would be so big that the
    problematic certificate would be immediately discovered by the
    Subscriber, who would complain to the CA and the Certificate would
    most
    likely be revoked immediately since it wouldn't be usable.


I literally provided you an explanation for why what you’re describing is problematic and unreasonable. Please do re-read it. In a new system, sure, that’s be great - but the existing system absolutely penalizes first movers.

Look at SC12 as an example. CAs would really like browsers to make that change, because then they can have their customers blame browsers for their misissuance. The customer is not going to say “Guess I should replace my cert”, but rather, blame the browser. The links I provided showed how CAs widespread disregard for the standards created real compatibility and security issues - and a browser just rejecting them doesn’t actually fix it, because the site says “well, works in other browsers, so the bug must be the browsers, not mine.”

I have listened to this argument before but unfortunately it leads nowhere. How badly are we interested in interop to justify being "the bad guys" and how "disruptive" will our actions be for Relying Parties? It is a very difficult problem to solve but the ecosystem has made progress:
- disclosure of intermediate CA Certificates
- identifying and fixing problematic OCSP responders
- increased supervision to the issued certificates with CT and linters providing public information about mis-issuances - browsers enforcing BR requirements with code (e.g. certificate validity duration)

With these controls in place, CAs are very much obligated to follow the rules or face the consequences. Browsers use telemetry to detect violations of the standards and create plans on addressing those issues. These plans usually include discussions in m.d.s.p. or the CA/B Forum in order for the CAs to participate and create the necessary rules -along with the browsers- to address these incompatibilities.


    What I meant to say in my original argument is that the "damage"
    created
    by a certificate that fails to strictly comply with RFC5280 and
    the rest
    of the X.* standards, as long as popular browsers "allow it", is
    primarily an issue between a Subscriber (that maintains a web
    site), and
    the particular Relying Parties that want to establish a secure
    connection to that web site. That's not the entire Internet. This
    is why
    I compared it with "a situation where a site operator forgets to send
    the intermediate CA Certificate in the chain. These particular RPs
    will
    fail to get TLS working when they visit the Subscriber's web site".


It’s a perfect example of why your argument DOESN’T work. As Mozilla has shared in the CA/B Forum, people don’t fix their site - they blame the browser, and keep on with the brokenness. Firefox is the one having to change to “accommodate” that.

Or, they might blame the CA for providing them a "thing" that doesn't work with all major browsers :)


    Perhaps I have misunderstood your argument but when we are discussing
    about revocation timelines, it looks a little extreme to say that
    a CA
    claiming "some important reasons" (I'm not saying if they are valid
    reasons or not) for delaying a certificate revocation, that they have
    zero incentive to follow the standards.


It isn’t extreme, because even the incident reports from 2014/2015 show exactly this argument being made. Your arguments themselves continue to show that, by suggesting that “only” the site is impacted. And yet, if every site is doing it because “only” that site is impacted, you have the whole ecosystem doing it.

This myopic view of trying to assess per-Certificate is inherently non-scalable. You haven’t actually proposed any way to address that. What happens when a CA is doing 100 “exceptional” non-revocations? What about 10,000? We’ve seen examples of both discussed - so nothing is new here. Do we make CAs also pay penalty fees, so that the community can ensure there is adequate staffing to investigate and review this? If we do that, what’s to prevent CAs from just seeing that as buying indulgences?

This statement underestimating the reflexes of the Root programs. The reason for requiring disclosure is meant as a first step for understanding what's happening in reality and collect some meaningful data by policy. Once Mozilla collects enough information to make a safe estimation, the policy can be updated to allow or forbid certain situations. If, for example, m.d.s.p. receives 10 or 20 revocation exception cases within a 12-month period and none of them is convincing to the community and module owners to justify the exception, the policy can be updated with clear rules about the risk of distrust if the revocation doesn't happen within 5 days. That would be a simple, clear rule. Does Mozilla have the information to make such an aggressive rule change today? Maybe.


Your whole proposal breaks down at scale. It’s like asking “What’s the harm if I start stealing candy bars - after all, it’s only a candy bar?” - without actually acknowledging the consequences of normalizing that behavior. It tries to frame the conversation as being about a $1 candy, which, while appealing, isn’t actually what is being discussed.

Maybe you’re blinded by optimism and faith in CAs. I think if you take a more realistic, grounded, and holistic view of the ecosystem - one that considers we were where you propose to go 8 years ago (and it was disastrous for the ecosystem), one that considers this is a shared commons, and one that acknowledges the misaligned incentives - you would realize we already know how and why this sort of suggestion doesn’t actually work in practice, because we have been there, done that.

    > Finally, CAs are terrible at assessing the risk to RPs. For
    example,
    > negative serial numbers were prolific prior to the linters, and
    those
    > have issues in as much as they are, for some systems, irrevocable.
    > This is because those systems implemented the standards correctly -
    > serials are positive INTEGERs - yet had to account for the fact
    that
    > CAs are improperly encoding them, such as by "making" them positive
    > (adding the leading zero). This leading zero then doesn't get
    stripped
    > off when looking up by Issuer & Serial Number, because they're
    using
    > the "spec-correct" serial rather than the "issuer-broken" serial.
    > That's an example where the certificate "works", no report is
    filed,
    > but the security and ecosystem properties are fatally
    compromised. The
    > alternatives for such implementation are:
    > 1) Reject such certificates (but see above about market forces and
    > interoperability)
    > 2) Correct both the certificate and the CRL/OCSP serial number
    (which
    > then creates risk because you're not actually checking _any_
    > certificates true serial)
    > 3) Allow negative serial numbers (which then makes it harder for
    > others to do #1)
    >
    > As I said, CAs have been terrible at assessing risk to the
    ecosystem
    > for their decisions. The page at
    >
    
https://wiki.mozilla.org/SecurityEngineering/mozpkix-testing#Things_for_CAs_to_Fix

    > shows how bad such interoperability harms improvements - for
    example,
    > all of these hacks that Mozilla had to add in order to ship a more
    > secure, more efficient certificate verifier.

    As I said earlier, times change. The bar is raised, this industry
    matures day-after-day, things are hopefully improving
(security-wise).

You said that, without any systemic data, without any support. Having the same conversation tomorrow that we had today because, hey, “times change”, may even be true, but it isn’t productive in the least.

I already provided some facts that I believe assisted in the security improvement of the ecosystem. The paper you cited also agrees with that statement. It's an ongoing effort for continuous improvement.


I disagree that we’ve seen systemic improvements as a whole. There are a few CAs trying to do better, but the incident reporting of today clearly shows exactly what I’m saying - that the industry has not actually matured as you suggest. What has changed has largely been driven by those outside CAs - whether those who were wanting to become CAs (Amazon with certlint) or those analyzing CA’s failures  (ZLint).

If we truly care about the ecosystem, it doesn't really matter where the systemic improvements come from. CAs and Browsers have contributed in the Network Security Guidelines, the BRs (to improve and limit validation methods, add CAA and so much more). I agree we should expect every CA to develop tools or use existing ones to ensure they are complying with all rules. We occasionally see some exceptions and this is evaluated on a case-by-case basis. "Accidents" and mistakes do happen and as it has been discussed in the past, it's collective failures that pose the greatest risk and we have seen hard decisions being made to minimize or eliminate these risks.


    In conclusion, after repeatedly seeing CAs requesting or effectively
    taking more time to revoke certificates that the existing
    requirements,
    I believe that a policy rule that would require CAs to disclose
    revocation cases requiring more than 5 days to complete (i.e.
    revoke the
    certificate), provided that the CA submits risk analysis information
    after working with the affected Subscriber(s), is a reasonable way
    forward.


I think it is grossly negligent and irresponsible, and is only reasonable if one ignores the past two decades (such as by glibly saying “times change”). A proposal based on submitting risk analysis merely outsources the costs from the Subscriber onto this community and RPs in general - who could easily become consumed with reading thousands upon thousands of these. Such an act is incredibly hostile to meaningful trust in CAs and the ecosystem.

Far more compelling is to reduce the timeframe that CAs can “go rogue” not revoking, by reducing the overall certificate lifetime. By improving the rate at which certificates are replaced, the “hardship” you spoke to (though seemingly agree it’s not actually there) can be reduced. This can be done without introducing the need for costly, and subjective, risk assessments or “exceptions”.

In any event, I think it’s unproductive to try to bring this conversation up without concrete data. If multiple CAs committed to publishing all of their revocation data in a systemic way - reasons, hardships, etc (NOT just the exceptional cases) - and committed to making funds available to be used to rigorously analyze this (e.g. funding Mozilla to hire someone for this, funding peer reviewed papers) - it might be worth revisiting. Then we could have concrete data that could, for example, show that these “hardships” are one-in-a-million (certs), and more reflective of poor organization controls by CAs and Subscribers, rather than a systemic problem to address.

I already stated my reasoning for keeping the disclosure just for exceptions. Currently, the only systemic technical way of providing something about the revocation is the revocation reason, and that's limited by RFC5280.

I also protest against the "grossly negligent and irresponsible" part and I'm afraid statements like that alienate people from participating and proposing anything. Simply disagreeing would ultimately have the same effect in this conversation. You have already provided good arguments against my proposal for people to evaluate.


Dimitris.
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy
  • CA disclosure of revocation... Dimitris Zacharopoulos via dev-security-policy
    • Re: CA disclosure of r... Ryan Sleevi via dev-security-policy
      • Re: CA disclosure ... Dimitris Zacharopoulos via dev-security-policy
        • Re: CA disclos... Ryan Sleevi via dev-security-policy
          • Re: CA dis... Fotis Loukos via dev-security-policy
            • Re: C... Jakob Bohm via dev-security-policy
              • R... Fotis Loukos via dev-security-policy
                • ... Dimitris Zacharopoulos via dev-security-policy
                • ... Ryan Sleevi via dev-security-policy
                • ... Fotis Loukos via dev-security-policy
                • ... Dimitris Zacharopoulos via dev-security-policy
                • ... Wayne Thayer via dev-security-policy
                • ... Jakob Bohm via dev-security-policy

Reply via email to