Thanks for this.  I found this feedback interesting and useful, and these 
considerations are definitely something we need to keep in mind as we try to 
get ACME adopted more widely.

These sorts of real-world considerations often get lost in the discussion of 
technical standards, which is unfortunate, because they are often the exact 
same considerations that prevent the technical standards from getting adopted 
as widely as they could be.

So perhaps you can be pleasantly surprised that your efforts are not in vain 😊  
We are here and we are listening.

-Tim

From: Acme <acme-boun...@ietf.org> On Behalf Of Matthew Holt
Sent: Friday, June 23, 2023 12:21 PM
To: acme@ietf.org
Subject: [Acme] Practical concerns of draft-ietf-acme-ari

Hi all,

I don't normally participate in these mailing lists, and last time I did I feel 
like the lack of discussion was discouraging, as what little discussion did 
occur wasn't taken seriously and was laced with complacency. Just stating up 
front that I don't have much hope for this message to be acted upon. That said, 
multiple people have strongly encouraged _someone_ to write the mailing list 
and bring the concerns of multiple ACME client developers to your attention.

I speak for myself, but my views have been formed from a combination of 
personal experience developing ACME clients and discussion with other ACME 
client developers. So when I say "we" I do so loosely; sometimes it might just 
be me.

First, I want to say: overall we like the idea of proactive ACME clients being 
able to know whether a certificate needs to be replaced sooner than expected, 
and we're glad to see an attempt at a solution drafted for standardization. But 
some of us do not think (current draft) ARI is The Way.

Now that several ACME client authors have had the opportunity to implement the 
spec, we've noticed some issues, both with fundamental flaws in the concept of 
ARI and some in implementation. Initially these concerns were raised at the 
Let's Encrypt forums:

- 
https://community.letsencrypt.org/t/can-ari-conforming-clients-be-granted-exemptions-to-relevant-rate-limits/195600?u=mholt
- 
https://community.letsencrypt.org/t/thoughts-from-starting-to-play-with-ari/200276?u=mholt
- https://community.letsencrypt.org/t/ari-rate-limits/198720?u=mholt
- https://community.letsencrypt.org/t/ari-retry-after-header/195471?u=mholt

And the overwhelming response seems to be, "Meh, take it to the mailing list." 
(Except for one response by LE staff about rate limits, which was appreciated, 
at least.) So here we are.

Cutting to the chase:

With respect to ARI, ACME servers and clients have conflicts of interest. The 
ACME client's goal is to keep the site up (with renewed and unrevoked 
certificates); the optimal way to do this is to start renewing early and retry 
often. The ACME server's goal is to keep the service up; the optimal way to do 
this is to suppress clients that overload your capacity. Obviously, these two 
goals are in opposition with each other. Proactive clients can spike demand, 
which can cause service interruptions. But service interruptions make clients 
more paranoid to retry even more often until it works, and so on. ARI narrows 
the timeframe in which a conforming client can retry failed renewals, which 
reduces reliability more as time goes on. Without ARI, this window is a 
reasonable ~60 days. With ARI, however, the window is reduced to just a few 
minutes, hours, or days. The less time until expiration, the less hope there is 
to renew the cert in time. As the draft currently stands, this is in the 
server's interest, but not the client's.

I can tell you, with the current draft, my ACME clients will use ARI as a 
signal to immediately try renewing a certificate, not for scheduling a renewal 
in the future.

Here's why.

The ACME client's goal is to keep the site up (with renewed and unrevoked 
certificates). If everything always worked, we'd simply renew after about 99% 
of the certificate's lifetime.

But obviously, that's not reality. In the presence of failures/uncertainty, the 
optimal way to maximize uptime is to start renewing early and retry often. In 
fact, just constantly be renewing. This offers the maximum possible chances to 
successfully get a certificate.

But obviously, that's not reality. CAs rightly enforce rate limits, and service 
uptime is actually Pretty Good most of the time, so we can reduce network 
traffic, load on the CA, and pressure on CT infrastructure by waiting until 
about 2/3 into a certificate's lifespan before trying to renew. (With Let's 
Encrypt certificates this gives 30 days of runway.) This is a fair balance and 
works well in practice.

But unfortunately, reality's not that simple. There are two off-nominal events 
that are often mentioned as the motivation for ARI:

1) Revocation
2) Traffic smoothing around expected maintenance or heavy load

Both of these can interfere with our happy little status-quo. Revocation means 
we need to replace the certificate sooner than expected, and maintenance or 
congestion means we may need to renew the certificate later than expected.

Enter ARI. ARI is the CA saying, "We suggest -- but do not require -- this 
specific timeframe within which to renew your certificate."

There are some problems with this:

1) It is optional. No one will implement this. OK, some clients will -- but I 
can say with authority from years of experience that optional restrictions are 
not typically favored. Very little mainstream software follow best practices to 
a tee.

2) A narrower renewal timeframe makes clients less reliable. In theory it 
should make them *more* reliable since it smooths out traffic, thus improving 
CA availability. But this assumes that most clients actually implement and 
follow ARI. Since it's optional, I don't see that happening. Especially since 
most ACME clients are still running as static cron jobs like it's 2015...

I'm sure ARI doesn't really change in the nominal case, which is 99.9..9% of 
the time. In fact, Let's Encrypt's ARI seems to correspond with when my clients 
attempt renewals on their own anyway. (So in that sense, ARI is actually 
useless 99.9..9% of the time?)

But when a renewal window does change, what does that mean? Well, something is 
wrong. Either the certificate is being revoked, or the CA anticipates downtime 
or availability issues.

Uh oh. That's bad news for a good little client which is trying its best to 
keep its sites (potentially tens of thousands of them) online.

If we wait until the (adjusted) window to start renewing, we run ourselves 
closer to the imminently-impending revocation or the expiration of the 
certificate, lowering our chances of a successful renewal. If this is a mass or 
CA-wide event, other clients have surely noticed too. Best to renew ASAP and 
give ourselves more chances for success. Worst-case scenario, we'll retry all 
the way into the designated window in which we expect to be able to get a 
certificate anyway. And we might have to do this for 10s of thousands of 
certificates.

Because ARI is optional, it only acts as an early warning for clients that wish 
for an advantage over other clients with the same goal when resources are 
scarce. In these conditions, it's first-come-first-serve and clients compete to 
preserve uptime for all their sites. (I think clients can still do this 
respectfully with backoff and jitter.)

Note that this behavior is still in compliance with the draft ARI spec, which 
says:

    Conforming clients MUST attempt renewal at a time of their choosing
    based on the suggested renewal window.

It doesn't say the renewal MUST be attempted "within" the window, just "based 
on" the window. (A minor language change to the spec, by the way, will not 
change client behaviors. I think we need to take a different approach to ARI, 
read on.)

Anyway, a few more practical issues/questions:

1) Many CAs enforce rate limits. If clients are to honor ARI windows, we would 
need a guarantee that the first successful cert within the ARI window will be 
allowed regardless of relevant rate limits. Because ARI restricts a client's 
ability to spread out renewals when managing certificates in bulk with respect 
to rate limits, the rate limits must NOT be a blocker when honoring ARI.

2) If ARI were actually enforced, some concerns would be resolved... for 
example, we can have assurances that other ACME clients are doing the same, 
thus improving CA availability. It would essentially be the CA scheduling each 
individual certificate for each ACME client instance -- that's quite a powerful 
idea, as long as availability is guaranteed (which it's not).

3) ARI does not scale well. Some ACME clients manage 10K+ certificates, and in 
that case the client would have to check the ARI for at least 24 certificates 
per hour to get through them in a month. Deferring to the Retry-After header 
may result in insufficient throughput. The current expectation or convention is 
to check every certificate every 6-12 hours, or tens of thousands of checks per 
day. One endpoint per certificate multiple times per day is quite saturating. 
This is a considerable burden for both ACME clients and servers. I would like 
to explore options that do not involve 2+ HTTP requests per certificate.

4) Crafting the URL is convoluted. As Peter Cooper described it, "The core 
issue is that the URL you need to construct is based on an OCSP structure 
identifying the certificate, which requires taking one's existing certificate 
and parsing out the serial number and issuer, and also taking the intermediate 
certificate that signed it and getting its public key too. So rather than just, 
like, using the fingerprint of the existing leaf or something similarly simple 
that a lot of tooling can already give you, one needs to really dig into both 
the leaf, and the intermediate, and hash various pieces thereof, and then take 
all that to build a new ASN.1 structure." Why are we striving for near-parity 
with an OCSP request?? This should be orthogonal to OCSP, right?

5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET request 
is not authenticated. Even if the information is not strictly sensitive, I can 
totally see some browsers or tools using ARI as a signal that a certificate is 
being revoked, and thus can no longer be trusted, and thus block a site before 
a server even sees that it needs to renew its cert. I could be incorrect, but 
can't the information needed to obtain ARI can be scraped from CT logs? If so, 
I think a global ARI monitor/database is inevitable, and that has interesting 
implications that I don't know have been fully realized.

All in all, the current ARI spec feels a little rushed. I'm hoping Let's 
Encrypt's production deployment is meant to help gather feedback about ARI 
before finalizing it, rather than to solidify it. Can we revisit both its 
fundamentals and practical implications too?

I would like to explore some alternatives to the current draft. I can think of 
two approaches that might address these concerns:

A) Instead of a totally separate flow to obtain ARI, simply utilize a 
Retry-After header in the flow of existing ACME responses. Upon finalizing an 
order, the ACME server can respond with a Retry-After header which acts as the 
current-draft Retry-After header for ARI responses. The client then attempts 
renewal at/after the Retry-After time, but with the OCSP CertID added to the 
NewOrder object; this indicates to the ACME server that the client is asking if 
now is a good time to renew the certificate indicated by the CertID. If it's 
not a good time, the ACME server can reply as such, with another Retry-After, 
and the client then waits and repeats, until the server actually issues the 
certificate. If the client needs the certificate immediately, simply omit the 
CertID from the NewOrder and the normal, "non-ARI" flow is assumed. This is 
backwards-compatible and requires no additional infrastructure or endpoints.

B) If we do need a separate flow for some reason, I would like to see a single 
endpoint containing a static JSON resource that describes all the active 
certificates that need early renewal, rather than one tediously-crafted URL per 
certificate. Certificates can be described by their NotBefore or NotAfter 
dates, serial numbers, or other relevant attributes. For example, if just a few 
certs with certain serials were misissued, those serials could be enumerated at 
this endpoint. Or if a mass revocation is happening, the timeframe of NotBefore 
dates could be listed, and ACME clients can simply check against the certs they 
manage with those dates, and replace them. You can represent millions of 
certificates in, like, 85 bytes this way. And it's way less work for clients 
and servers. And lastly, drop the "window" idea -- certificates described by 
this endpoint should be renewed ASAP: try to renew immediately, then back off 
and retry, for reasons described above (once we know the future is uncertain 
and/or revocation is imminent, current certs can't be trusted and/or clients 
must try to preserve their sites' uptime).

And finally, I want to bring attention to the longer-term prospects for ARI: 
it's quite possible that ARI will become irrelevant before it is widely adopted 
by most clients. This itself may discourage adoption. As stated above, ARI has 
two primary use cases: revocation and traffic smoothing. As we push for shorter 
certificate lifetimes, revocation should become irrelevant. And traffic 
smoothing will perhaps become a natural consequence as clients are renewing 
more frequently anyway. We all know revocation and long-lived certificates are 
broken, so I'd rather WebPKI developers focus our energy on the ACTUAL goal: 
short-lived certificates. We should not be focusing our ecosystem resources on 
infrastructure that acts as a band-aid for a broken leg.
That said, I'm not opposed to the general idea of a renewal hint for clients in 
the meantime as long as it's simple, makes fundamental sense, and is actually 
effective. I think the issues described above are mostly solvable and now 
hopefully we can get there from here.

_______________________________________________
Acme mailing list
Acme@ietf.org
https://www.ietf.org/mailman/listinfo/acme

Reply via email to