Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3

David Benjamin Sat, 27 Apr 2024 08:03:42 -0700

What should the next steps be here? Is this a bunch of errata, or something
else?


On Wed, Apr 17, 2024 at 10:08 AM David Benjamin <david...@chromium.org>
wrote:

> > Sender implementations should already be able to retransmit messages
> with older epochs due to the "duplicated" post-auth state machine
>
> The nice thing about option 7 is that the older epochs retransmit problem
> becomes moot in updated senders, I think. If the sender doesn't activate
> epoch N+1 until KeyUpdate *and prior messages* are ACKed and if KeyUpdate
> is required to be the last handshake message in epoch N, then the previous
> epoch is guaranteed to be empty by the time you activate it.
>
> On Wed, Apr 17, 2024, 09:27 Marco Oliverio <ma...@wolfssl.com> wrote:
>
>> Hi David,
>>
>> Thanks for pointing this out. I also favor solution 7 as it's the simpler
>> approach and it doesn't require too much effort to add in current
>> implementations.
>> Sender implementations should already be able to retransmit messages with
>> older epochs due to the "duplicated" post-auth state machine.
>>
>> Marco
>>
>> On Tue, Apr 16, 2024 at 3:48 PM David Benjamin <david...@chromium.org>
>> wrote:
>>
>>> Thanks, Hannes!
>>>
>>> Since it was buried in there (my understanding of the issue evolved as I
>>> described it), I currently favor option 7. I.e. the sender-only fix to the
>>> KeyUpdate criteria.
>>>
>>> At first I thought we should also change the receiver to mitigate
>>> unfixed senders, but this situation should be pretty rare (most senders
>>> will send NewSessionTicket well before they KeyUpdate), DTLS 1.3 isn't very
>>> widely deployed yet, and ultimately, it's on the sender implementation to
>>> make sure all states they can get into are coherent.
>>>
>>> If the sender crashed, that's unambiguously on the sender to fix. If the
>>> sender still correctly retransmits the missing messages, the connection
>>> will perform suboptimally for a blip but still recover.
>>>
>>> David
>>>
>>>
>>> On Tue, Apr 16, 2024, 05:19 Tschofenig, Hannes <
>>> hannes.tschofe...@siemens.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>>
>>>>
>>>> this is great feedback. Give me a few days to respond to this issue
>>>> with my suggestion for moving forward.
>>>>
>>>>
>>>>
>>>> Ciao
>>>>
>>>> Hannes
>>>>
>>>>
>>>>
>>>> *From:* TLS <tls-boun...@ietf.org> *On Behalf Of *David Benjamin
>>>> *Sent:* Saturday, April 13, 2024 7:59 PM
>>>> *To:* <tls@ietf.org> <tls@ietf.org>
>>>> *Cc:* Nick Harper <nhar...@chromium.org>
>>>> *Subject:* Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3
>>>>
>>>>
>>>>
>>>> Another issues with DTLS 1.3's state machine duplication scheme:
>>>>
>>>>
>>>>
>>>> Section 8 says implementation must not send new KeyUpdate until the
>>>> KeyUpdate is ACKed, but it says nothing about other post-handshake
>>>> messages. Suppose KeyUpdate(5) in flight and the implementation decides to
>>>> send NewSessionTicket. (E.g. the application called some
>>>> "send NewSessionTicket" API.) The new epoch doesn't exist yet, so naively
>>>> one would start sending NewSessionTicket(6) in the current epoch. Now the
>>>> peer ACKs KeyUpdate(5), so we transition to the new epoch. But
>>>> retransmissions must retain their original epoch:
>>>>
>>>>
>>>>
>>>> > Implementations MUST send retransmissions of lost messages using the
>>>> same epoch and keying material as the original transmission.
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-3
>>>>
>>>>
>>>>
>>>> This means we must keep sending the NST at the old epoch. But the peer
>>>> may have no idea there's a message at that epoch due to packet loss!
>>>> Section 8 does ask the peer to keep the old epoch around for a spell, but
>>>> eventually the peer will discard the old epoch. If NST(6) didn't get
>>>> through before then, the entire post-handshake stream is now wedged!
>>>>
>>>>
>>>>
>>>> I think this means we need to amend Section 8 to forbid sending *any*
>>>> post-handshake message after KeyUpdate. That is, rather than saying you
>>>> cannot send a new KeyUpdate, a KeyUpdate terminates the post-handshake
>>>> stream at that epoch and all new post-handshake messages, be they KeyUpdate
>>>> or anything else, must be enqueued for the new epoch. This is a little
>>>> unfortunate because a TLS library which transparently KeyUpdates will then
>>>> inadvertently introduce hiccups where post-handshake messages triggered by
>>>> the application, like post-handshake auth, are blocked.
>>>>
>>>>
>>>>
>>>> That then suggests some more options for fixing the original problem.
>>>>
>>>>
>>>>
>>>> *7. Fix the sender's KeyUpdate criteria*
>>>>
>>>>
>>>>
>>>> We tell the sender to wait for all previous messages to be ACKed too.
>>>> Fix the first paragraph of section 8 to say:
>>>>
>>>>
>>>>
>>>> > As with other handshake messages with no built-in response,
>>>> KeyUpdates MUST be acknowledged. Acknowledgements are used to both control
>>>> retransmission and transition to the next epoch. Implementations MUST NOT
>>>> send records with the new keys until the KeyUpdate *and all preceding
>>>> messages* have been acknowledged. This facilitates epoch
>>>> reconstruction (Section 4.2.2) and avoids too many epochs in active use, by
>>>> ensuring the peer has processed the KeyUpdate and started receiving at the
>>>> new epoch.
>>>>
>>>> >
>>>>
>>>> > A KeyUpdate message terminates the post-handshake stream in an epoch.
>>>> After sending KeyUpdate in an epoch, implementations MUST NOT send any new
>>>> post-handshake messages in that epoch. Note that, if the implementation has
>>>> sent KeyUpdate but is waiting for an ACK, the next epoch is not yet active.
>>>> In this case, subsequent post-handshake messages may not be sent until
>>>> receiving the ACK.
>>>>
>>>>
>>>>
>>>> And then on the receiver side, we leave things as-is. If the sender
>>>> implemented the old semantics AND had multiple post-handshake transactions
>>>> in parallel, it might update keys too early and then we get into the
>>>> situation described in (1). We then declare that, if this happens, and the
>>>> sender gets confused as a result, that's the sender's fault. Hopefully this
>>>> is not rare enough (did anyone even implement 5.8.4, or does everyone just
>>>> serialize their post-handshake transitions?) to not be a serious protocol
>>>> break? That risk aside, this option seems the most in spirit with the
>>>> current design to me.
>>>>
>>>>
>>>>
>>>> *8. Decouple post-handshake retransmissions from epochs*
>>>>
>>>>
>>>>
>>>> If we instead say that the same epoch rule only applies for the
>>>> handshake, and not post-handshake messages, I think option 5 (process
>>>> KeyUpdate out of order) might become viable? I'm not sure. Either way, this
>>>> seems like a significant protocol break, so I don't think this is an option
>>>> until some hypothetical DTLS 1.4.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Apr 12, 2024 at 6:59 PM David Benjamin <david...@chromium.org>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> This is going to be a bit long. In short, DTLS 1.3 KeyUpdates seem to
>>>> conflate the peer *receiving* the KeyUpdate with the peer *processing* the
>>>> KeyUpdate, in ways that appear to break some assumptions made by the
>>>> protocol design.
>>>>
>>>>
>>>>
>>>> *When to switch keys in KeyUpdate*
>>>>
>>>>
>>>>
>>>> So, first, DTLS 1.3, unlike TLS 1.3, applies the KeyUpdate on the ACK,
>>>> not when the KeyUpdate is sent. This makes sense because KeyUpdate records
>>>> are not intrinsically ordered with app data records sent after them:
>>>>
>>>>
>>>>
>>>> > As with other handshake messages with no built-in response,
>>>> KeyUpdates MUST be acknowledged. In order to facilitate epoch
>>>> reconstruction (Section 4.2.2), implementations MUST NOT send records with
>>>> the new keys or send a new KeyUpdate until the previous KeyUpdate has been
>>>> acknowledged (this avoids having too many epochs in active use).
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-8-1
>>>>
>>>>
>>>>
>>>> Now, the parenthetical says this is to avoid having too many epochs in
>>>> active use, but it appears that there are stronger assumptions on this:
>>>>
>>>>
>>>>
>>>> > After the handshake is complete, if the epoch bits do not match those
>>>> from the current epoch, implementations SHOULD use the most recent *
>>>> *past** epoch which has matching bits, and then reconstruct the
>>>> sequence number for that epoch as described above.
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.2-3
>>>>
>>>> (emphasis mine)
>>>>
>>>>
>>>>
>>>> > After the handshake, implementations MUST use the highest available
>>>> sending epoch [to send ACKs]
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-7-7
>>>>
>>>>
>>>>
>>>> These two snippets imply the protocol wants the peer to definitely have
>>>> installed the new keys before you start using them. This makes sense
>>>> because sending stuff the peer can't decrypt is pretty silly. As an aside,
>>>> DTLS 1.3 retains this text from DTLS 1.2:
>>>>
>>>>
>>>>
>>>> > Conversely, it is possible for records that are protected with the
>>>> new epoch to be received prior to the completion of a handshake. For
>>>> instance, the server may send its Finished message and then start
>>>> transmitting data. Implementations MAY either buffer or discard such
>>>> records, though when DTLS is used over reliable transports (e.g., SCTP
>>>> [RFC4960]), they SHOULD be buffered and processed once the handshake
>>>> completes.
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-2
>>>>
>>>>
>>>> The text from DTLS 1.2 talks about *a* handshake, which presumably
>>>> refers to rekeying via renegotiation. But in DTLS 1.3, the epoch
>>>> reconstruction rule and the KeyUpdate rule mean this is only possible
>>>> during the handshake, when you see epoch 4 and expect epoch 0-3. The steady
>>>> state rekeying mechanism never hits this case. (This is a reasonable change
>>>> because there's no sense in unnecessarily introducing blips where the
>>>> connection is less tolerant of reordering.)
>>>>
>>>>
>>>>
>>>> *Buffered handshake messages*
>>>>
>>>>
>>>>
>>>> Okay, so KeyUpdates want to wait for the recipient to install keys,
>>>> except we don't seem to actually achieve this! Section 5.2 says:
>>>>
>>>>
>>>>
>>>> > DTLS implementations maintain (at least notionally) a
>>>> next_receive_seq counter. This counter is initially set to zero. When a
>>>> handshake message is received, if its message_seq value matches
>>>> next_receive_seq, next_receive_seq is incremented and the message is
>>>> processed. If the sequence number is less than next_receive_seq, the
>>>> message MUST be discarded. If the sequence number is greater than
>>>> next_receive_seq, the implementation SHOULD queue the message but MAY
>>>> discard it. (This is a simple space/bandwidth trade-off).
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-7
>>>>
>>>>
>>>>
>>>> I assume this is intended to apply to post-handshake messages too. (See
>>>> below for a discussion of the alternative.) But that means that, when you
>>>> receive a KeyUpdate, you might not immediately process it. Suppose
>>>> next_receive_seq is 5, and the peer sends NewSessionTicket(5),
>>>> NewSessionTicket(6), and KeyUpdate(7). 5 is lost, but 6 and 7 come in,
>>>> perhaps even in the same record which means that you're forced to ACK both
>>>> or neither. But suppose the implementation is willing to buffer 3 messages
>>>> ahead, so it ACKs the 6+7 record, by the rules in section 7, which permits
>>>> ACKing fragments that were buffered and not yet processed.
>>>>
>>>>
>>>>
>>>> That means the peer will switch keys and now all subsequent records
>>>> from them will come from epoch N+1. But the sender is not ready for N+1
>>>> yet, so we contradict everything above. We also contradict this
>>>> parenthetical in section 8:
>>>>
>>>>
>>>>
>>>> > Due to loss and/or reordering, DTLS 1.3 implementations may receive a
>>>> record with an older epoch than the current one (the requirements above
>>>> preclude receiving a newer record).
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-8-2
>>>>
>>>>
>>>>
>>>> I assume then that this was not actually what was intended.
>>>>
>>>>
>>>>
>>>> *Options (and non-options)*
>>>>
>>>>
>>>>
>>>> Assuming I'm reading this right, we seem to have made a mess of things.
>>>> The sender could avoid this by only allowing one active post-handshake
>>>> transaction at a time and serializing them, at the cost of taking a
>>>> round-trip for each. But the receiver needs to account for all possible
>>>> senders, so that doesn't help. Some options that come to mind:
>>>>
>>>>
>>>>
>>>> *1. Accept that the sender updates its keys too early*
>>>>
>>>>
>>>>
>>>> Apart from contradicting most of the specification text, the protocol
>>>> doesn't *break* per se if you just allow the peer to switch keys early
>>>> in this buffered KeyUpdate case. We *merely* contradict all of the
>>>> explanatory text and introduce a bunch of cases that the specification
>>>> suggests are impossible. :-) Also the connection quality is poor.
>>>>
>>>>
>>>>
>>>> The sender will use epoch N+1 at a point when the peer is on N. But
>>>> epoch reconstruction will misread it as N-3 instead of N+1, and either way
>>>> you won't have the keys to decrypt it yet! The connection is interrupted
>>>> (and with all packets discarded because epoch reconstruction fails!) until
>>>> the peer retransmits 5 and you catch up. Until then, not only will you not
>>>> receive application data, but you also won't receive ACKs. This also adds a
>>>> subtle corner case on the sender side: the sender cannot discard the old
>>>> sending keys because it still has unACKed messages from the previous epoch
>>>> to retransmit, but this is not called out in section 8. Section 8 only
>>>> discusses the receiver needing to retain the old epoch.
>>>>
>>>>
>>>> This seems not great. Also it contradicts much of the text in the spec,
>>>> including section 8 explicitly saying this case cannot happen.
>>>>
>>>>
>>>>
>>>> *2. Never ACK buffered KeyUpdates*
>>>>
>>>>
>>>>
>>>> We can say that KeyUpdates are special and, unless you're willing to
>>>> process them immediately, you must not ACK the records containing them.
>>>> This means you might under-ACK and the peer might over-retransmit, but
>>>> seems not fatal. This also seems a little hairy to implement if you want to
>>>> avoid under-ACKing unnecessarily. You might have message
>>>> NewSessionTicket(6) buffered and then receive a record with
>>>> NewSessionTicket(5) and KeyUpdate(7). That record may appear unACKable, but
>>>> it's fine because you'll immediately process 5 then 6 then 7... unless your
>>>> NewSessionTicket process is asynchronous, in which case it might not be?
>>>>
>>>>
>>>>
>>>> Despite all that mess, this seems the most viable option?
>>>>
>>>>
>>>>
>>>> *3. Declare this situation a sender error*
>>>>
>>>>
>>>>
>>>> We could say this is not allowed and senders MUST NOT send KeyUpdate if
>>>> there are any outstanding post-handshake messages. And then the receiver
>>>> should fail with unexpected_message if it ever receives KeyUpdate at a
>>>> future message_seq. But as the RFC is already published, I don't know if
>>>> this is compatible with existing implementations.
>>>>
>>>>
>>>>
>>>> *4. Explicit KeyUpdateAck message*
>>>>
>>>>
>>>>
>>>> We could have made a KeyUpdateAck message to signal that you've
>>>> processed a KeyUpdate, not just sent it. But that's a protocol change and
>>>> the RFC is stamped, so it's too late now.
>>>>
>>>>
>>>>
>>>> *5. Process KeyUpdate out of order*
>>>>
>>>>
>>>>
>>>> We could say that the receiver doesn't buffer KeyUpdate. It just goes
>>>> ahead and processes it immediately to install epoch N+1. This seems like it
>>>> would address the issue but opens more cans of worms. Now the receiver
>>>> needs to keep the old epoch around for more than packet reorder, but also
>>>> to pick up the retransmissions of the missing handshake messages. Also, by
>>>> activating the new epoch, the receiver now allows the sender to KeyUpdate
>>>> again, and again, and again. But, several epochs later, the holes in the
>>>> message stream may remain unfilled, so we still need the old keys. Without
>>>> further protocol rules, a sender could force the receiver to keep keys
>>>> arbitrarily many records back. All this is, at best, a difficult case that
>>>> is unlikely to be well-tested, and at worst get the implementation into
>>>> some broken state and then misbehave badly.
>>>>
>>>>
>>>>
>>>> *6. Post-handshake transactions aren't ordered at all*
>>>>
>>>>
>>>>
>>>> It could be that my assumption above was wrong and the next_receive_seq
>>>> discussion in 5.2 only applies to the handshake. After all, section 5.8.4
>>>> discusses how every post-handshake transaction duplicates the "state
>>>> machine". Except it only says to duplicate the 5.8.1 state machine, and
>>>> it's unclear ambiguous whether that includes the message_seq logic.
>>>>
>>>>
>>>>
>>>> However, going this direction seems to very quickly make a mess. If
>>>> each post-handshake transaction handles message_seq independently, you
>>>> cannot distinguish a retransmission from a new transaction. That seems
>>>> quite bad, so presumably the intent was to use message_seq to distinguish
>>>> those. (I.e. the intent can't have been to duplicate the message_seq
>>>> state.) Indeed, we have:
>>>>
>>>>
>>>>
>>>> > However, in DTLS 1.3 the message_seq is not reset, to allow
>>>> distinguishing a retransmission from a previously sent post-handshake
>>>> message from a newly sent post-handshake message.
>>>>
>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-6
>>>>
>>>>
>>>>
>>>> But if we distinguish with message_seq AND process transactions out of
>>>> order, now receivers need to keep track of fairly complex state in case
>>>> they process messages 5, 7, 9, 11, 13, 15, 17, ... but then only get the
>>>> even ones later. And we'd need to define some kind of sliding window for
>>>> what happens if you receive message_seq 9000 all of a sudden. And we import
>>>> all the cross-epoch problems in option 5 above. None of that is in the
>>>> text, so I assume this was not the intended reading, and I don't think we
>>>> want to go that direction. :-)
>>>>
>>>>
>>>> * Digression: ACK fate-sharing and flow control*
>>>>
>>>>
>>>>
>>>> All this alludes to another quirk that isn't a problem, but is a little
>>>> non-obvious and warrants some discussion in the spec. Multiple handshake
>>>> fragments may be packed into the same record, but ACKs apply to the whole
>>>> record. If you receive a fragment for a message sequence too far into the
>>>> future, you are permitted to discard the fragment. But if you discard
>>>> *any* fragment, you cannot ACK the record, *even if there were
>>>> fragments which you did process*. During the handshake, an
>>>> implementation could avoid needing to make this decision by knowing the
>>>> maximum size of a handshake flight. After the handshake, there is no
>>>> inherent limit on how many NewSessionTickets the peer may choose to send in
>>>> a row, and no flow control.
>>>>
>>>>
>>>>
>>>> QUIC ran into a similar issue here and said an implementation can
>>>> choose an ad-hoc limit, after which it can choose to either wedge the
>>>> post-handshake stream or return an error.
>>>>
>>>> https://github.com/quicwg/base-drafts/issues/1834
>>>> https://github.com/quicwg/base-drafts/pull/2524
>>>>
>>>>
>>>>
>>>> I suspect the most practical outcome for DTLS (and arguably already
>>>> supported by the existing text, but not very obviously), is to instead say
>>>> the receiver just refuses to ACK stuff and, okay, maybe in some weird edge
>>>> cases the receiver under-ACKs and then the sender over-retransmits, until
>>>> things settle down. Whereas ACKs are a bit more tightly integrated with
>>>> QUIC, so refusing to ACK a packet due to one bad frame is less of an
>>>> option. Still, I think this would have been worth calling out in the text.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So... did I read all this right? Did we indeed make a mess of this, or
>>>> did I miss something?
>>>>
>>>>
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>> TLS mailing list
>>> TLS@ietf.org
>>> https://www.ietf.org/mailman/listinfo/tls
>>>
>>

_______________________________________________
TLS mailing list
TLS@ietf.org
https://www.ietf.org/mailman/listinfo/tls

Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS 1.3

Reply via email to