Re: [Acme] Practical concerns of draft-ietf-acme-ari

Aaron Gable Wed, 19 Jul 2023 15:06:18 -0700

Hi Matt,

Agreed with Tim, receiving practical feedback from implementers of the
draft standard is very useful. I'll put my thoughts, comments, and
questions in-line.

On Fri, Jun 23, 2023 at 9:21 AM Matthew Holt <m...@dyanim.com> wrote:

>
> With respect to ARI, ACME servers and clients have conflicts of interest.
> The ACME client's goal is to keep the site up (with renewed and unrevoked
> certificates); the optimal way to do this is to start renewing early and
> retry often. The ACME server's goal is to keep the service up; the optimal
> way to do this is to suppress clients that overload your capacity.
> Obviously, these two goals are in opposition with each other. Proactive
> clients can spike demand, which can cause service interruptions. But
> service interruptions make clients more paranoid to retry even more often
> until it works, and so on. ARI narrows the timeframe in which a conforming
> client can retry failed renewals, which reduces reliability more as time
> goes on. Without ARI, this window is a reasonable ~60 days. With ARI,
> however, the window is reduced to just a few minutes, hours, or days. The
> less time until expiration, the less hope there is to renew the cert in
> time. As the draft currently stands, this is in the server's interest, but
> not the client's.
>

I'm confused by the statement that "with ARI the window is reduced to just
a few minutes, hours, or days". The draft spec clearly states that the
client should renew during the window if it can, but that any time after
the window is also acceptable: "if the selected time is in the past,
attempt renewal immediately". The renewal window only becomes reduced to a
few minutes, hours, or days if the ACME server shifts the suggested renewal
window that far. Which, sure, is possible, but is clearly against the
server's best interest as well: if the ACME server can't provide continuity
of business to their Subscribers, then their Subscribers will go elsewhere
for certificates.

Can this be improved? Absolutely, I'm certain of it. I'd love to hear
suggestions for ways that the server could suggest a renewal time that
doesn't run up against this push/pull between wanting to smooth traffic
without making clients nervous. Unfortunately, I don't believe either of
the suggestions at the bottom of the message actually addresses this point
(more on that below).

> 1) It is optional. No one will implement this. OK, some clients will --
> but I can say with authority from years of experience that optional
> restrictions are not typically favored. Very little mainstream software
> follow best practices to a tee.
>

Yep, optional features are difficult to incentivize. I think there's one
obvious carrot to incentivize client adoption: "if you implement ARI, your
certs will be renewed *before* they're revoked in the next mass revocation
incident". Continuity of business can be a powerful motivator. Frankly,
Let's Encrypt is even considering bigger carrots, such as "your subscriber
account can only get short-lived certs if we've seen it request ARI
endpoints", or "your renewal requests bypass all rate limits if they're
made within the ARI suggested window". We don't know if we'll dangle either
of those carrots, but it's clear that there are ways to incentivize
adoption.

> 2) A narrower renewal timeframe makes clients less reliable. In theory it
> should make them *more* reliable since it smooths out traffic, thus
> improving CA availability. But this assumes that most clients actually
> implement and follow ARI. Since it's optional, I don't see that happening.
> Especially since most ACME clients are still running as static cron jobs
> like it's 2015...
>
> I'm sure ARI doesn't really change in the nominal case, which is 99.9..9%
> of the time. In fact, Let's Encrypt's ARI seems to correspond with when my
> clients attempt renewals on their own anyway. (So in that sense, ARI is
> actually useless 99.9..9% of the time?)
>
> But when a renewal window does change, what does that mean? Well,
> something is wrong. Either the certificate is being revoked, or the CA
> anticipates downtime or availability issues.
>

This is not true. Explicitly, by the spec, the renewal window changing
means nothing. The situations you list are the motivations for writing the
spec in the first place, but they are not the only motivations for changing
the window in any given case. In fact, Let's Encrypt is currently
considering adding random jitter to the renewal window every time it is
requested, specifically to prevent interpretations like this, and to
naturally even-out renewal spikes through Brownian motion.

> If we wait until the (adjusted) window to start renewing, we run ourselves
> closer to the imminently-impending revocation or the expiration of the
> certificate, lowering our chances of a successful renewal.
>

This assumes that the adjusted window will always be later in the lifetime
of the certificate than before. There is no reason to make this assumption.
A CA adjusting suggested windows in order to smooth out a load spike would
be wise to shift 50% of renewal windows *earlier*. Waiting to renew until a
time that is earlier than when you would have renewed anyway does not make
things riskier.

> 1) Many CAs enforce rate limits. If clients are to honor ARI windows, we
> would need a guarantee that the first successful cert within the ARI window
> will be allowed regardless of relevant rate limits. Because ARI restricts a
> client's ability to spread out renewals when managing certificates in bulk
> with respect to rate limits, the rate limits must NOT be a blocker when
> honoring ARI.
>

I like this idea. We hope and plan to implement this regardless, as I
suggested above with regards to it being a carrot that we can dangle to
incentivize client adoption. However, I don't believe it is something that
can be reasonably specified in an IETF RFC: rate limits are not part of the
ACME protocol, they're an internal detail of ACME server implementations.
Happy to be proven wrong.

> 2) If ARI were actually enforced, some concerns would be resolved... for
> example, we can have assurances that other ACME clients are doing the same,
> thus improving CA availability. It would essentially be the CA scheduling
> each individual certificate for each ACME client instance -- that's quite a
> powerful idea, as long as availability is guaranteed (which it's not).
>

What do you mean by "enforced"? Deny newOrder requests that appear to be
renewals but fall outside the suggested window?

> 3) ARI does not scale well. Some ACME clients manage 10K+ certificates,
> and in that case the client would have to check the ARI for at least 24
> certificates per hour to get through them in a month. Deferring to the
> Retry-After header may result in insufficient throughput. The current
> expectation or convention is to check every certificate every 6-12 hours,
> or tens of thousands of checks per day. One endpoint per certificate
> multiple times per day is quite saturating. This is a considerable burden
> for both ACME clients and servers. I would like to explore options that do
> not involve 2+ HTTP requests per certificate.
>

Totally agreed, we don't love the heavy-polling nature of ARI as it stands
either. It's a lot of requests, and that's a large part of why we've
striven to keep the response size so small. The original version of this
was just a single timestamp. It's grown to two timestamps and an optional
URL thanks to community feedback, but I'd be happy to reduce the response
size again if we decide that prioritizing efficiency is more important than
prioritizing third-party certificate monitoring tools.

Unfortunately, I don't currently have a different approach that I love. The
24-hour revocation timeline enforced by the BRs for certain kinds of
revocations means that clients should be checking at least once every 24
hours, regardless of mechanism. I'll comment more on your specific
proposals to address this below.

4) Crafting the URL is convoluted. As Peter Cooper described it, "The core
> issue is that the URL you need to construct is based on an OCSP structure
> identifying the certificate, which requires taking one's existing
> certificate and parsing out the serial number and issuer, and also taking
> the intermediate certificate that signed it and getting its public key too.
> So rather than just, like, using the fingerprint of the existing leaf or
> something similarly simple that a lot of tooling can already give you, one
> needs to really dig into both the leaf, and the intermediate, and hash
> various pieces thereof, and then take all that to build a new ASN.1
> structure." Why are we striving for near-parity with an OCSP request?? This
> should be orthogonal to OCSP, right?
>

This is great feedback. We picked this request format specifically because
we thought it would be easy. It's good to know that we were wrong, and
investigate what other request formats would work better.

Allow me to provide a little bit of context for how we arrived at using the
OCSP CertID structure:

We need a way to uniquely identify the certificate in question. ACME has
one mechanism for doing so already: the URL provided by a finalized Order.
Personally, my ideal would be to say "the ARI url is the Certificate URL
concatenated with /ari". Unfortunately we can't do that, because there's
nothing to prevent the URL provided by an Order from having query
parameters, in which case appending a new path component would be
incorrect. So, we could follow ACME's example, and provide a second
"renewalInfo" URL in finalized Orders as well. Unfortunately, this a) means
that clients have to persist this URL in order to use it, and b) clients
which did not persist the URL (either ephemeral clients, or third-party
certificate monitoring clients) cannot construct the URL at all.

So we need a way to uniquely identify a certificate which can be
constructed from the certificate itself. The serial seems like an obvious
candidate. However, serials are only required to be unique on a per-issuer
basis, and a single ACME server may issue from multiple issuer
certificates. It turns out that OCSP already has a solution for this:
combine the serial with a unique identifier of the issuer. And OCSP's
solution even comes with algorithm agility for how the unique identifier of
the issuer is computed! That's nice. So we took OCSP's request format,
stripped away the pieces not pertaining to identifying a single
certificate, et voila, the CertID.

We believed this would be easy because many ACME clients are written in
languages or running in environments that already have access to robust
OCSP libraries. I wrote the first version of this
<https://github.com/letsencrypt/boulder/blob/73b72e8fa2d852a40753926c34f38313a7db083d/wfe2/wfe_test.go#L3517-L3538>
(constructing
an OCSP request, parsing it, extracting the relevant parameters, and
serializing them into a CertID) in a few minutes. Again, it's useful to
know that we were wrong.

This leads to the question of: what should we use to uniquely identify the
certificate instead? Certainly we could go with the "fingerprint" or
"thumbprint" (a sha256 hash of DER bytes or PEM encoding, depending on who
you ask, of the certificate) if people think that is sufficiently simple,
easy to specify, unique, and future-proof. We could also go with "just the
Serial", and force existing ACME servers to choose between either keeping
serials unique across all issuers they represent, or splitting the server
into multiple servers which each represent just a single issuer. Or we
could return to the "url in the Order object" approach we started with. I'm
curious what path forward people think is best.

> 5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET
> request is not authenticated. Even if the information is not strictly
> sensitive, I can totally see some browsers or tools using ARI as a signal
> that a certificate is being revoked, and thus can no longer be trusted, and
> thus block a site before a server even sees that it needs to renew its
> cert. I could be incorrect, but can't the information needed to obtain ARI
> can be scraped from CT logs? If so, I think a global ARI monitor/database
> is inevitable, and that has interesting implications that I don't know have
> been fully realized.
>

Yes, as mentioned above, this was a design goal as a result of community
feedback. See this early discussion
<https://mailarchive.ietf.org/arch/msg/acme/szDHa5z6qRiAtmeC2ohrePPoBjU/>
for context. Again, this is a design goal that I'd be willing to compromise
if there are sufficient reasons to do so, but I don't think that argument
has been fully articulated as of yet.

> All in all, the current ARI spec feels a little rushed. I'm hoping Let's
> Encrypt's production deployment is meant to help gather feedback about ARI
> before finalizing it, rather than to solidify it. Can we revisit both its
> fundamentals and practical implications too?
>

Yes, the IETF process is about "rough consensus and running code". We can't
finalize the spec until something is running. Let's Encrypt's deployment,
and our encouragement of client adoption, is so that we can receive
precisely this kind of feedback before the draft becomes an RFC.

> I would like to explore some alternatives to the current draft. I can
> think of two approaches that might address these concerns:
>
> A) Instead of a totally separate flow to obtain ARI, simply utilize a
> Retry-After header in the flow of existing ACME responses. Upon finalizing
> an order, the ACME server can respond with a Retry-After header which acts
> as the current-draft Retry-After header for ARI responses. The client then
> attempts renewal at/after the Retry-After time, but with the OCSP CertID
> added to the NewOrder object; this indicates to the ACME server that the
> client is asking if now is a good time to renew the certificate indicated
> by the CertID. If it's not a good time, the ACME server can reply as such,
> with another Retry-After, and the client then waits and repeats, until the
> server actually issues the certificate. If the client needs the certificate
> immediately, simply omit the CertID from the NewOrder and the normal,
> "non-ARI" flow is assumed. This is backwards-compatible and requires no
> additional infrastructure or endpoints.
>

I don't understand how this approach helps solve the issues you identified
above. In order to get up-to-date information, the same number of requests
still need to be made, it's just that now they're newOrder requests instead
of renewalInfo requests. The unique identifier included in the request is
no easier to construct. The Retry-After timestamp changing might still
cause selfish clients to stop providing the CertID and renew right now.

Now, I *am* a fan of adding a field to newOrder requests which uniquely
identifies the cert being replaced. If such a field is populated, the CA
would treat it the same as if the client had made a POST request to mark
the certificate as replaced (Section 4.2 of the current draft). This has
many nice effects, like letting the CA track renewals explicitly (instead
of attempting to identify them with heuristics), letting renewal requests
bypass rate limits, and more. I just don't think it elegantly replaces the
renewalInfo endpoint itself.

> B) If we do need a separate flow for some reason, I would like to see a
> single endpoint containing a static JSON resource that describes all the
> active certificates that need early renewal, rather than one
> tediously-crafted URL per certificate. Certificates can be described by
> their NotBefore or NotAfter dates, serial numbers, or other relevant
> attributes. For example, if just a few certs with certain serials were
> misissued, those serials could be enumerated at this endpoint. Or if a mass
> revocation is happening, the timeframe of NotBefore dates could be listed,
> and ACME clients can simply check against the certs they manage with those
> dates, and replace them. You can represent millions of certificates in,
> like, 85 bytes this way. And it's way less work for clients and servers.
> And lastly, drop the "window" idea -- certificates described by this
> endpoint should be renewed ASAP: try to renew immediately, then back off
> and retry, for reasons described above (once we know the future is
> uncertain and/or revocation is imminent, current certs can't be trusted
> and/or clients must try to preserve their sites' uptime).
>

On the one hand, I'm in complete agreement, it would be great to have a
"batch" endpoint that returns suggested windows for all certificates
associated with a given account, or matching some other criteria. On the
other hand, there's a reason that Let's Encrypt diverges from RFC8555 and
does not implement the "orders" field on account objects: endpoints which
serve unboundedly-large documents and require paging are difficult to
implement correctly on both the server and client side, and can quickly
lead to disruptive database queries.

> And finally, I want to bring attention to the longer-term prospects for
> ARI: it's quite possible that ARI will become irrelevant before it is
> widely adopted by most clients. This itself may discourage adoption. As
> stated above, ARI has two primary use cases: revocation and traffic
> smoothing. As we push for shorter certificate lifetimes, revocation should
> become irrelevant. And traffic smoothing will perhaps become a natural
> consequence as clients are renewing more frequently anyway. We all know
> revocation and long-lived certificates are broken, so I'd rather WebPKI
> developers focus our energy on the ACTUAL goal: short-lived certificates.
> We should not be focusing our ecosystem resources on infrastructure that
> acts as a band-aid for a broken leg.
>

This is an interesting point. ARI was first conceived
<https://bugzilla.mozilla.org/show_bug.cgi?id=1619179#c7> as a way to
improve business continuity across mass revocation events, and grew from
there. The idea that 10-day certs might be a reality, and that revocation
would be wholly optional for them, was almost unimaginable at that time.
But even today, the reality is that CAs such as Let's Encrypt will likely
have to support revocation for a very long time to come: migrating the
whole world to 10-day certs will not happen overnight. So I think that this
work is worthwhile, even if other solutions are also on the horizon.

Thanks,
Aaron

_______________________________________________
Acme mailing list
Acme@ietf.org
https://www.ietf.org/mailman/listinfo/acme

Re: [Acme] Practical concerns of draft-ietf-acme-ari

Reply via email to