Sure, happy to provide more details! The fundamental issue here is the scale at which Let's Encrypt issues, and the automated nature by which clients interact with Let's Encrypt.
LE currently has 150M certificates active, all (as of March 1st) signed by the same issuer certificate, R3. In the event of a mass revocation, that means a CRL with 150M entries in it. At an average of 38 bytes per entry in a CRL, that means nearly 6GB worth of CRL. Passing around a single 6GB file isn't good for reliability (it's much better to fail-and-retry downloading one of a hundred 60MB files than fail-and-retry a single 6GB file), so sharding seems like an operational necessity. Even without a LE-initiated mass revocation event, one of our large integrators (such as a hosting provider with millions of domains) could decide for any reason to revoke every single certificate we have issued to them. We need to be resilient to these kinds of events. Once we've decided that sharding is necessary, the next question is "static or dynamic sharding?". It's easy to imagine a world in which we usually have only one or two CRL shards, but dynamically scale that number up to keep individual CRL sizes small if/when revocation rises sharply. There are a lot of "interesting" (read: difficult) engineering problems here, and we've decided not to go the dynamic route, but even if we did it would obviously require being able to change the list of URLs in the JSON array on the fly. For static sharding, we would need to constantly maintain a large set of small CRLs, such that even in the worst case no individual CRL would become too large. I see two main approaches: maintaining a fully static set of shards into which our certificates are bucketed, or maintaining rolling time-based shards (much like CT shards). Maintaining a static set of shards has the primary advantage of "working like CRLs usually work". A given CRL has a scope (e.g. "all certs issued by R3 whose serial number is equal to 1 mod 500"), it has a nextUpdate, and a new CRL with the same scope will be re-issued at the same path before that nextUpdate is reached. However, it makes re-sharding difficult. If Let's Encrypt's issuance rises enough that we want to have 1000 shards instead of 500, we'll have to re-shard every cert, re-issue every CRL, and update the list of URLs in the JSON. And if we're updating the list, we should have standards around how that list is updated and how its history is stored, and then we'd prefer that those standards allow for rapid updates. The alternative is to have rolling time-based shards. In this case, every X hours we would create a new CRL, and every certificate we issue over the next period would belong to that CRL. Similar to the above, these CRLs have nice scopes: "all certs issued by R3 between AA:BB and XX:YY"). When every certificate in one of these time-based shards has expired, we can simply stop re-issuing it. This has the advantage of solving the resharding problem: if we want to make our CRLs smaller, we just increase the frequency at which we initialize a new one, and 90 days later we've fully switched over to the new size. It has the disadvantage from your perspective of requiring us to add a new URL to the JSON array every period (and we get to drop an old URL from the array every period as well). So why would we want to put each CRL re-issuance at a new path, and update our JSON even more frequently? Because we have reason to believe that various root programs will soon seek CRL re-issuance on the order of every 6 hours, not every 7 days as currently required; we will have many shards; and overwriting files is a dangerous operation prone to many forms of failure. Our current plan is to surface CRLs at paths like `/crls/:issuerID/:shardID/:thisUpdate.der`, so that we never have to overwrite a file. Similarly, our JSON document can always be written to a new file, and the path in CCADB can point to a simple handler which always serves the most recent file. Additionally, this means that anyone in possession of one of our JSON documents can fetch all the CRLs listed in it and get a *consistent* view of our revocation information as of that time. I believe that there is an argument to be made here that this plan increases the auditability of the CRLs, rather than decreases it. Root programs could require that any published JSON document be valid for a certain period of time, and that all CRLs within that document remain available for that period as well. Or even that historical versions of CRLs remain available until every certificate they cover has expired (which is what we intend to do anyway). Researchers can crawl our history of CRLs and examine revocation events in more detail than previously available. Regardless, even without statically-pathed, timestamped CRLs, I believe that the merits of rolling time-based shards are sufficient to be a strong argument in favor of dynamic JSON documents. I hope this helps and that I addressed your questions, Aaron On Thu, Feb 25, 2021 at 9:53 AM Ryan Sleevi <r...@sleevi.com> wrote: > > > On Thu, Feb 25, 2021 at 12:33 PM Aaron Gable via dev-security-policy < > dev-security-policy@lists.mozilla.org> wrote: > >> Obviously this plan may have changed due to other off-list conversations, >> but I would like to express a strong preference for the original plan. At >> the scale at which Let's Encrypt issues, it is likely that our JSON array >> will contain on the order of 1000 CRL URLs, and will add a new one (and >> age >> out an entirely-expired one) every 6 hours or so. I am not aware of any >> existing automation which updates CCADB at that frequency. >> >> Further, from a resiliency perspective, we would prefer that the CRLs we >> generate live at fully static paths. Rather than overwriting CRLs with new >> versions when they are re-issued prior to their nextUpdate time, we would >> leave the old (soon-to-be-expired) CRL in place, offer its replacement at >> an adjacent path, and update the JSON to point at the replacement. This >> process would have us updating the JSON array on the order of minutes, not >> hours. > > > This seems like a very inefficient design choice, and runs contrary to how > CRLs are deployed by, well, literally anyone using CRLs as specified, since > the URL is fixed within the issued certificate. > > Could you share more about the design of why? Both for the choice to use > sharded CRLs (since that is the essence of the first concern), and the > motivation to use fixed URLs. > > We believe that earlier "URL to a JSON array..." approach makes room for >> significantly simpler automation on the behalf of CAs without significant >> loss of auditability. I believe it may be helpful for the CCADB field >> description (or any upcoming portion of the MRSP which references it) to >> include specific requirements around the cache lifetime of the JSON >> document and the CRLs referenced within it. > > > Indirectly, you’ve highlighted exactly why the approach you propose loses > auditability. Using the URL-based approach puts the onus on the consumer to > try and detect and record changes, introduces greater operational risks > that evade detection (e.g. stale caches on the CAs side for the content of > that URL), and encourages or enables designs that put greater burden on > consumers. > > I don’t think this is suggested because of malice, but I do think it makes > it significantly easier for malice to go undetected, for accurate historic > information to be hidden or made too complex to maintain. > > This is already a known and, as of recent, studied problem with CRLs [1]. > Unquestionably, you are right for highlighting and emphasizing that this > constrains and limits how CAs perform certain operations. You highlight it > as a potential bug, but I’d personally been thinking about it as a > potential feature. To figure out the disconnect, I’m hoping you could > further expand on the “why” of the design factors for your proposed design. > > Additionally, it’d be useful to understand how you would suggest CCADB > consumers maintain an accurate, CA attested log of changes. Understanding > such changes is an essential part of root program maintenance, and it does > seem reasonable to expect CAs to need to adjust to provide that, rather > than give up on the goal. > > [1] > https://arxiv.org/abs/2102.04288 > >> _______________________________________________ dev-security-policy mailing list dev-security-policy@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-security-policy