Hi all,

I don't normally participate in these mailing lists, and last time I did I
feel like the lack of discussion was discouraging, as what little
discussion did occur wasn't taken seriously and was laced with complacency.
Just stating up front that I don't have much hope for this message to be
acted upon. That said, multiple people have strongly encouraged _someone_
to write the mailing list and bring the concerns of multiple ACME client
developers to your attention.

I speak for myself, but my views have been formed from a combination of
personal experience developing ACME clients and discussion with other ACME
client developers. So when I say "we" I do so loosely; sometimes it might
just be me.

First, I want to say: overall we like the idea of proactive ACME clients
being able to know whether a certificate needs to be replaced sooner than
expected, and we're glad to see an attempt at a solution drafted for
standardization. But some of us do not think (current draft) ARI is The Way.

Now that several ACME client authors have had the opportunity to implement
the spec, we've noticed some issues, both with fundamental flaws in the
concept of ARI and some in implementation. Initially these concerns were
raised at the Let's Encrypt forums:

-
https://community.letsencrypt.org/t/can-ari-conforming-clients-be-granted-exemptions-to-relevant-rate-limits/195600?u=mholt
-
https://community.letsencrypt.org/t/thoughts-from-starting-to-play-with-ari/200276?u=mholt
- https://community.letsencrypt.org/t/ari-rate-limits/198720?u=mholt
- https://community.letsencrypt.org/t/ari-retry-after-header/195471?u=mholt

And the overwhelming response seems to be, "Meh, take it to the mailing
list." (Except for one response by LE staff about rate limits, which was
appreciated, at least.) So here we are.

Cutting to the chase:

With respect to ARI, ACME servers and clients have conflicts of interest.
The ACME client's goal is to keep the site up (with renewed and unrevoked
certificates); the optimal way to do this is to start renewing early and
retry often. The ACME server's goal is to keep the service up; the optimal
way to do this is to suppress clients that overload your capacity.
Obviously, these two goals are in opposition with each other. Proactive
clients can spike demand, which can cause service interruptions. But
service interruptions make clients more paranoid to retry even more often
until it works, and so on. ARI narrows the timeframe in which a conforming
client can retry failed renewals, which reduces reliability more as time
goes on. Without ARI, this window is a reasonable ~60 days. With ARI,
however, the window is reduced to just a few minutes, hours, or days. The
less time until expiration, the less hope there is to renew the cert in
time. As the draft currently stands, this is in the server's interest, but
not the client's.

I can tell you, with the current draft, my ACME clients will use ARI as a
signal to immediately try renewing a certificate, not for scheduling a
renewal in the future.

Here's why.

The ACME client's goal is to keep the site up (with renewed and unrevoked
certificates). If everything always worked, we'd simply renew after about
99% of the certificate's lifetime.

But obviously, that's not reality. In the presence of failures/uncertainty,
the optimal way to maximize uptime is to start renewing early and retry
often. In fact, just constantly be renewing. This offers the maximum
possible chances to successfully get a certificate.

But obviously, that's not reality. CAs rightly enforce rate limits, and
service uptime is actually Pretty Good most of the time, so we can reduce
network traffic, load on the CA, and pressure on CT infrastructure by
waiting until about 2/3 into a certificate's lifespan before trying to
renew. (With Let's Encrypt certificates this gives 30 days of runway.) This
is a fair balance and works well in practice.

But unfortunately, reality's not that simple. There are two off-nominal
events that are often mentioned as the motivation for ARI:

1) Revocation
2) Traffic smoothing around expected maintenance or heavy load

Both of these can interfere with our happy little status-quo. Revocation
means we need to replace the certificate sooner than expected, and
maintenance or congestion means we may need to renew the certificate later
than expected.

Enter ARI. ARI is the CA saying, "We suggest -- but do not require -- this
specific timeframe within which to renew your certificate."

There are some problems with this:

1) It is optional. No one will implement this. OK, some clients will -- but
I can say with authority from years of experience that optional
restrictions are not typically favored. Very little mainstream software
follow best practices to a tee.

2) A narrower renewal timeframe makes clients less reliable. In theory it
should make them *more* reliable since it smooths out traffic, thus
improving CA availability. But this assumes that most clients actually
implement and follow ARI. Since it's optional, I don't see that happening.
Especially since most ACME clients are still running as static cron jobs
like it's 2015...

I'm sure ARI doesn't really change in the nominal case, which is 99.9..9%
of the time. In fact, Let's Encrypt's ARI seems to correspond with when my
clients attempt renewals on their own anyway. (So in that sense, ARI is
actually useless 99.9..9% of the time?)

But when a renewal window does change, what does that mean? Well, something
is wrong. Either the certificate is being revoked, or the CA anticipates
downtime or availability issues.

Uh oh. That's bad news for a good little client which is trying its best to
keep its sites (potentially tens of thousands of them) online.

If we wait until the (adjusted) window to start renewing, we run ourselves
closer to the imminently-impending revocation or the expiration of the
certificate, lowering our chances of a successful renewal. If this is a
mass or CA-wide event, other clients have surely noticed too. Best to renew
ASAP and give ourselves more chances for success. Worst-case scenario,
we'll retry all the way into the designated window in which we expect to be
able to get a certificate anyway. And we might have to do this for 10s of
thousands of certificates.

Because ARI is optional, it only acts as an early warning for clients that
wish for an advantage over other clients with the same goal when resources
are scarce. In these conditions, it's first-come-first-serve and clients
compete to preserve uptime for all their sites. (I think clients can still
do this respectfully with backoff and jitter.)

Note that this behavior is still in compliance with the draft ARI spec,
which says:

    Conforming clients MUST attempt renewal at a time of their choosing
    based on the suggested renewal window.

It doesn't say the renewal MUST be attempted "within" the window, just
"based on" the window. (A minor language change to the spec, by the way,
will not change client behaviors. I think we need to take a different
approach to ARI, read on.)

Anyway, a few more practical issues/questions:

1) Many CAs enforce rate limits. If clients are to honor ARI windows, we
would need a guarantee that the first successful cert within the ARI window
will be allowed regardless of relevant rate limits. Because ARI restricts a
client's ability to spread out renewals when managing certificates in bulk
with respect to rate limits, the rate limits must NOT be a blocker when
honoring ARI.

2) If ARI were actually enforced, some concerns would be resolved... for
example, we can have assurances that other ACME clients are doing the same,
thus improving CA availability. It would essentially be the CA scheduling
each individual certificate for each ACME client instance -- that's quite a
powerful idea, as long as availability is guaranteed (which it's not).

3) ARI does not scale well. Some ACME clients manage 10K+ certificates, and
in that case the client would have to check the ARI for at least 24
certificates per hour to get through them in a month. Deferring to the
Retry-After header may result in insufficient throughput. The current
expectation or convention is to check every certificate every 6-12 hours,
or tens of thousands of checks per day. One endpoint per certificate
multiple times per day is quite saturating. This is a considerable burden
for both ACME clients and servers. I would like to explore options that do
not involve 2+ HTTP requests per certificate.

4) Crafting the URL is convoluted. As Peter Cooper described it, "The core
issue is that the URL you need to construct is based on an OCSP structure
identifying the certificate, which requires taking one's existing
certificate and parsing out the serial number and issuer, and also taking
the intermediate certificate that signed it and getting its public key too.
So rather than just, like, using the fingerprint of the existing leaf or
something similarly simple that a lot of tooling can already give you, one
needs to really dig into both the leaf, and the intermediate, and hash
various pieces thereof, and then take all that to build a new ASN.1
structure." Why are we striving for near-parity with an OCSP request?? This
should be orthogonal to OCSP, right?

5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET
request is not authenticated. Even if the information is not strictly
sensitive, I can totally see some browsers or tools using ARI as a signal
that a certificate is being revoked, and thus can no longer be trusted, and
thus block a site before a server even sees that it needs to renew its
cert. I could be incorrect, but can't the information needed to obtain ARI
can be scraped from CT logs? If so, I think a global ARI monitor/database
is inevitable, and that has interesting implications that I don't know have
been fully realized.

All in all, the current ARI spec feels a little rushed. I'm hoping Let's
Encrypt's production deployment is meant to help gather feedback about ARI
before finalizing it, rather than to solidify it. Can we revisit both its
fundamentals and practical implications too?

I would like to explore some alternatives to the current draft. I can think
of two approaches that might address these concerns:

A) Instead of a totally separate flow to obtain ARI, simply utilize a
Retry-After header in the flow of existing ACME responses. Upon finalizing
an order, the ACME server can respond with a Retry-After header which acts
as the current-draft Retry-After header for ARI responses. The client then
attempts renewal at/after the Retry-After time, but with the OCSP CertID
added to the NewOrder object; this indicates to the ACME server that the
client is asking if now is a good time to renew the certificate indicated
by the CertID. If it's not a good time, the ACME server can reply as such,
with another Retry-After, and the client then waits and repeats, until the
server actually issues the certificate. If the client needs the certificate
immediately, simply omit the CertID from the NewOrder and the normal,
"non-ARI" flow is assumed. This is backwards-compatible and requires no
additional infrastructure or endpoints.

B) If we do need a separate flow for some reason, I would like to see a
single endpoint containing a static JSON resource that describes all the
active certificates that need early renewal, rather than one
tediously-crafted URL per certificate. Certificates can be described by
their NotBefore or NotAfter dates, serial numbers, or other relevant
attributes. For example, if just a few certs with certain serials were
misissued, those serials could be enumerated at this endpoint. Or if a mass
revocation is happening, the timeframe of NotBefore dates could be listed,
and ACME clients can simply check against the certs they manage with those
dates, and replace them. You can represent millions of certificates in,
like, 85 bytes this way. And it's way less work for clients and servers.
And lastly, drop the "window" idea -- certificates described by this
endpoint should be renewed ASAP: try to renew immediately, then back off
and retry, for reasons described above (once we know the future is
uncertain and/or revocation is imminent, current certs can't be trusted
and/or clients must try to preserve their sites' uptime).

And finally, I want to bring attention to the longer-term prospects for
ARI: it's quite possible that ARI will become irrelevant before it is
widely adopted by most clients. This itself may discourage adoption. As
stated above, ARI has two primary use cases: revocation and traffic
smoothing. As we push for shorter certificate lifetimes, revocation should
become irrelevant. And traffic smoothing will perhaps become a natural
consequence as clients are renewing more frequently anyway. We all know
revocation and long-lived certificates are broken, so I'd rather WebPKI
developers focus our energy on the ACTUAL goal: short-lived certificates.
We should not be focusing our ecosystem resources on infrastructure that
acts as a band-aid for a broken leg.

That said, I'm not opposed to the general idea of a renewal hint for
clients in the meantime as long as it's simple, makes fundamental sense,
and is actually effective. I think the issues described above are mostly
solvable and now hopefully we can get there from here.
_______________________________________________
Acme mailing list
Acme@ietf.org
https://www.ietf.org/mailman/listinfo/acme

Reply via email to