Hi all, I don't normally participate in these mailing lists, and last time I did I feel like the lack of discussion was discouraging, as what little discussion did occur wasn't taken seriously and was laced with complacency. Just stating up front that I don't have much hope for this message to be acted upon. That said, multiple people have strongly encouraged _someone_ to write the mailing list and bring the concerns of multiple ACME client developers to your attention.
I speak for myself, but my views have been formed from a combination of personal experience developing ACME clients and discussion with other ACME client developers. So when I say "we" I do so loosely; sometimes it might just be me. First, I want to say: overall we like the idea of proactive ACME clients being able to know whether a certificate needs to be replaced sooner than expected, and we're glad to see an attempt at a solution drafted for standardization. But some of us do not think (current draft) ARI is The Way. Now that several ACME client authors have had the opportunity to implement the spec, we've noticed some issues, both with fundamental flaws in the concept of ARI and some in implementation. Initially these concerns were raised at the Let's Encrypt forums: - https://community.letsencrypt.org/t/can-ari-conforming-clients-be-granted-exemptions-to-relevant-rate-limits/195600?u=mholt - https://community.letsencrypt.org/t/thoughts-from-starting-to-play-with-ari/200276?u=mholt - https://community.letsencrypt.org/t/ari-rate-limits/198720?u=mholt - https://community.letsencrypt.org/t/ari-retry-after-header/195471?u=mholt And the overwhelming response seems to be, "Meh, take it to the mailing list." (Except for one response by LE staff about rate limits, which was appreciated, at least.) So here we are. Cutting to the chase: With respect to ARI, ACME servers and clients have conflicts of interest. The ACME client's goal is to keep the site up (with renewed and unrevoked certificates); the optimal way to do this is to start renewing early and retry often. The ACME server's goal is to keep the service up; the optimal way to do this is to suppress clients that overload your capacity. Obviously, these two goals are in opposition with each other. Proactive clients can spike demand, which can cause service interruptions. But service interruptions make clients more paranoid to retry even more often until it works, and so on. ARI narrows the timeframe in which a conforming client can retry failed renewals, which reduces reliability more as time goes on. Without ARI, this window is a reasonable ~60 days. With ARI, however, the window is reduced to just a few minutes, hours, or days. The less time until expiration, the less hope there is to renew the cert in time. As the draft currently stands, this is in the server's interest, but not the client's. I can tell you, with the current draft, my ACME clients will use ARI as a signal to immediately try renewing a certificate, not for scheduling a renewal in the future. Here's why. The ACME client's goal is to keep the site up (with renewed and unrevoked certificates). If everything always worked, we'd simply renew after about 99% of the certificate's lifetime. But obviously, that's not reality. In the presence of failures/uncertainty, the optimal way to maximize uptime is to start renewing early and retry often. In fact, just constantly be renewing. This offers the maximum possible chances to successfully get a certificate. But obviously, that's not reality. CAs rightly enforce rate limits, and service uptime is actually Pretty Good most of the time, so we can reduce network traffic, load on the CA, and pressure on CT infrastructure by waiting until about 2/3 into a certificate's lifespan before trying to renew. (With Let's Encrypt certificates this gives 30 days of runway.) This is a fair balance and works well in practice. But unfortunately, reality's not that simple. There are two off-nominal events that are often mentioned as the motivation for ARI: 1) Revocation 2) Traffic smoothing around expected maintenance or heavy load Both of these can interfere with our happy little status-quo. Revocation means we need to replace the certificate sooner than expected, and maintenance or congestion means we may need to renew the certificate later than expected. Enter ARI. ARI is the CA saying, "We suggest -- but do not require -- this specific timeframe within which to renew your certificate." There are some problems with this: 1) It is optional. No one will implement this. OK, some clients will -- but I can say with authority from years of experience that optional restrictions are not typically favored. Very little mainstream software follow best practices to a tee. 2) A narrower renewal timeframe makes clients less reliable. In theory it should make them *more* reliable since it smooths out traffic, thus improving CA availability. But this assumes that most clients actually implement and follow ARI. Since it's optional, I don't see that happening. Especially since most ACME clients are still running as static cron jobs like it's 2015... I'm sure ARI doesn't really change in the nominal case, which is 99.9..9% of the time. In fact, Let's Encrypt's ARI seems to correspond with when my clients attempt renewals on their own anyway. (So in that sense, ARI is actually useless 99.9..9% of the time?) But when a renewal window does change, what does that mean? Well, something is wrong. Either the certificate is being revoked, or the CA anticipates downtime or availability issues. Uh oh. That's bad news for a good little client which is trying its best to keep its sites (potentially tens of thousands of them) online. If we wait until the (adjusted) window to start renewing, we run ourselves closer to the imminently-impending revocation or the expiration of the certificate, lowering our chances of a successful renewal. If this is a mass or CA-wide event, other clients have surely noticed too. Best to renew ASAP and give ourselves more chances for success. Worst-case scenario, we'll retry all the way into the designated window in which we expect to be able to get a certificate anyway. And we might have to do this for 10s of thousands of certificates. Because ARI is optional, it only acts as an early warning for clients that wish for an advantage over other clients with the same goal when resources are scarce. In these conditions, it's first-come-first-serve and clients compete to preserve uptime for all their sites. (I think clients can still do this respectfully with backoff and jitter.) Note that this behavior is still in compliance with the draft ARI spec, which says: Conforming clients MUST attempt renewal at a time of their choosing based on the suggested renewal window. It doesn't say the renewal MUST be attempted "within" the window, just "based on" the window. (A minor language change to the spec, by the way, will not change client behaviors. I think we need to take a different approach to ARI, read on.) Anyway, a few more practical issues/questions: 1) Many CAs enforce rate limits. If clients are to honor ARI windows, we would need a guarantee that the first successful cert within the ARI window will be allowed regardless of relevant rate limits. Because ARI restricts a client's ability to spread out renewals when managing certificates in bulk with respect to rate limits, the rate limits must NOT be a blocker when honoring ARI. 2) If ARI were actually enforced, some concerns would be resolved... for example, we can have assurances that other ACME clients are doing the same, thus improving CA availability. It would essentially be the CA scheduling each individual certificate for each ACME client instance -- that's quite a powerful idea, as long as availability is guaranteed (which it's not). 3) ARI does not scale well. Some ACME clients manage 10K+ certificates, and in that case the client would have to check the ARI for at least 24 certificates per hour to get through them in a month. Deferring to the Retry-After header may result in insufficient throughput. The current expectation or convention is to check every certificate every 6-12 hours, or tens of thousands of checks per day. One endpoint per certificate multiple times per day is quite saturating. This is a considerable burden for both ACME clients and servers. I would like to explore options that do not involve 2+ HTTP requests per certificate. 4) Crafting the URL is convoluted. As Peter Cooper described it, "The core issue is that the URL you need to construct is based on an OCSP structure identifying the certificate, which requires taking one's existing certificate and parsing out the serial number and issuer, and also taking the intermediate certificate that signed it and getting its public key too. So rather than just, like, using the fingerprint of the existing leaf or something similarly simple that a lot of tooling can already give you, one needs to really dig into both the leaf, and the intermediate, and hash various pieces thereof, and then take all that to build a new ASN.1 structure." Why are we striving for near-parity with an OCSP request?? This should be orthogonal to OCSP, right? 5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET request is not authenticated. Even if the information is not strictly sensitive, I can totally see some browsers or tools using ARI as a signal that a certificate is being revoked, and thus can no longer be trusted, and thus block a site before a server even sees that it needs to renew its cert. I could be incorrect, but can't the information needed to obtain ARI can be scraped from CT logs? If so, I think a global ARI monitor/database is inevitable, and that has interesting implications that I don't know have been fully realized. All in all, the current ARI spec feels a little rushed. I'm hoping Let's Encrypt's production deployment is meant to help gather feedback about ARI before finalizing it, rather than to solidify it. Can we revisit both its fundamentals and practical implications too? I would like to explore some alternatives to the current draft. I can think of two approaches that might address these concerns: A) Instead of a totally separate flow to obtain ARI, simply utilize a Retry-After header in the flow of existing ACME responses. Upon finalizing an order, the ACME server can respond with a Retry-After header which acts as the current-draft Retry-After header for ARI responses. The client then attempts renewal at/after the Retry-After time, but with the OCSP CertID added to the NewOrder object; this indicates to the ACME server that the client is asking if now is a good time to renew the certificate indicated by the CertID. If it's not a good time, the ACME server can reply as such, with another Retry-After, and the client then waits and repeats, until the server actually issues the certificate. If the client needs the certificate immediately, simply omit the CertID from the NewOrder and the normal, "non-ARI" flow is assumed. This is backwards-compatible and requires no additional infrastructure or endpoints. B) If we do need a separate flow for some reason, I would like to see a single endpoint containing a static JSON resource that describes all the active certificates that need early renewal, rather than one tediously-crafted URL per certificate. Certificates can be described by their NotBefore or NotAfter dates, serial numbers, or other relevant attributes. For example, if just a few certs with certain serials were misissued, those serials could be enumerated at this endpoint. Or if a mass revocation is happening, the timeframe of NotBefore dates could be listed, and ACME clients can simply check against the certs they manage with those dates, and replace them. You can represent millions of certificates in, like, 85 bytes this way. And it's way less work for clients and servers. And lastly, drop the "window" idea -- certificates described by this endpoint should be renewed ASAP: try to renew immediately, then back off and retry, for reasons described above (once we know the future is uncertain and/or revocation is imminent, current certs can't be trusted and/or clients must try to preserve their sites' uptime). And finally, I want to bring attention to the longer-term prospects for ARI: it's quite possible that ARI will become irrelevant before it is widely adopted by most clients. This itself may discourage adoption. As stated above, ARI has two primary use cases: revocation and traffic smoothing. As we push for shorter certificate lifetimes, revocation should become irrelevant. And traffic smoothing will perhaps become a natural consequence as clients are renewing more frequently anyway. We all know revocation and long-lived certificates are broken, so I'd rather WebPKI developers focus our energy on the ACTUAL goal: short-lived certificates. We should not be focusing our ecosystem resources on infrastructure that acts as a band-aid for a broken leg. That said, I'm not opposed to the general idea of a renewal hint for clients in the meantime as long as it's simple, makes fundamental sense, and is actually effective. I think the issues described above are mostly solvable and now hopefully we can get there from here.
_______________________________________________ Acme mailing list Acme@ietf.org https://www.ietf.org/mailman/listinfo/acme