To follow up here, I've filed https://www.rfc-editor.org/errata/eid8047 with "option 7" from the thread.
On Sat, Apr 27, 2024 at 8:11 AM Watson Ladd <watsonbl...@gmail.com> wrote: > > > On Sat, Apr 27, 2024, 8:03 AM David Benjamin <david...@chromium.org> > wrote: > >> What should the next steps be here? Is this a bunch of errata, or >> something else? >> > > Errata at a minimum but this might be big enough for a small RFC > describing the fix. > >> >> On Wed, Apr 17, 2024 at 10:08 AM David Benjamin <david...@chromium.org> >> wrote: >> >>> > Sender implementations should already be able to retransmit messages >>> with older epochs due to the "duplicated" post-auth state machine >>> >>> The nice thing about option 7 is that the older epochs retransmit >>> problem becomes moot in updated senders, I think. If the sender doesn't >>> activate epoch N+1 until KeyUpdate *and prior messages* are ACKed and if >>> KeyUpdate is required to be the last handshake message in epoch N, then the >>> previous epoch is guaranteed to be empty by the time you activate it. >>> >>> On Wed, Apr 17, 2024, 09:27 Marco Oliverio <ma...@wolfssl.com> wrote: >>> >>>> Hi David, >>>> >>>> Thanks for pointing this out. I also favor solution 7 as it's the >>>> simpler approach and it doesn't require too much effort to add in current >>>> implementations. >>>> Sender implementations should already be able to retransmit messages >>>> with older epochs due to the "duplicated" post-auth state machine. >>>> >>>> Marco >>>> >>>> On Tue, Apr 16, 2024 at 3:48 PM David Benjamin <david...@chromium.org> >>>> wrote: >>>> >>>>> Thanks, Hannes! >>>>> >>>>> Since it was buried in there (my understanding of the issue evolved as >>>>> I described it), I currently favor option 7. I.e. the sender-only fix to >>>>> the KeyUpdate criteria. >>>>> >>>>> At first I thought we should also change the receiver to mitigate >>>>> unfixed senders, but this situation should be pretty rare (most senders >>>>> will send NewSessionTicket well before they KeyUpdate), DTLS 1.3 isn't >>>>> very >>>>> widely deployed yet, and ultimately, it's on the sender implementation to >>>>> make sure all states they can get into are coherent. >>>>> >>>>> If the sender crashed, that's unambiguously on the sender to fix. If >>>>> the sender still correctly retransmits the missing messages, the >>>>> connection >>>>> will perform suboptimally for a blip but still recover. >>>>> >>>>> David >>>>> >>>>> >>>>> On Tue, Apr 16, 2024, 05:19 Tschofenig, Hannes < >>>>> hannes.tschofe...@siemens.com> wrote: >>>>> >>>>>> Hi David, >>>>>> >>>>>> >>>>>> >>>>>> this is great feedback. Give me a few days to respond to this issue >>>>>> with my suggestion for moving forward. >>>>>> >>>>>> >>>>>> >>>>>> Ciao >>>>>> >>>>>> Hannes >>>>>> >>>>>> >>>>>> >>>>>> *From:* TLS <tls-boun...@ietf.org> *On Behalf Of *David Benjamin >>>>>> *Sent:* Saturday, April 13, 2024 7:59 PM >>>>>> *To:* <tls@ietf.org> <tls@ietf.org> >>>>>> *Cc:* Nick Harper <nhar...@chromium.org> >>>>>> *Subject:* Re: [TLS] Issues with buffered, ACKed KeyUpdates in DTLS >>>>>> 1.3 >>>>>> >>>>>> >>>>>> >>>>>> Another issues with DTLS 1.3's state machine duplication scheme: >>>>>> >>>>>> >>>>>> >>>>>> Section 8 says implementation must not send new KeyUpdate until the >>>>>> KeyUpdate is ACKed, but it says nothing about other post-handshake >>>>>> messages. Suppose KeyUpdate(5) in flight and the implementation decides >>>>>> to >>>>>> send NewSessionTicket. (E.g. the application called some >>>>>> "send NewSessionTicket" API.) The new epoch doesn't exist yet, so naively >>>>>> one would start sending NewSessionTicket(6) in the current epoch. Now the >>>>>> peer ACKs KeyUpdate(5), so we transition to the new epoch. But >>>>>> retransmissions must retain their original epoch: >>>>>> >>>>>> >>>>>> >>>>>> > Implementations MUST send retransmissions of lost messages using >>>>>> the same epoch and keying material as the original transmission. >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-3 >>>>>> >>>>>> >>>>>> >>>>>> This means we must keep sending the NST at the old epoch. But the >>>>>> peer may have no idea there's a message at that epoch due to packet loss! >>>>>> Section 8 does ask the peer to keep the old epoch around for a spell, but >>>>>> eventually the peer will discard the old epoch. If NST(6) didn't get >>>>>> through before then, the entire post-handshake stream is now wedged! >>>>>> >>>>>> >>>>>> >>>>>> I think this means we need to amend Section 8 to forbid sending *any* >>>>>> post-handshake message after KeyUpdate. That is, rather than saying you >>>>>> cannot send a new KeyUpdate, a KeyUpdate terminates the post-handshake >>>>>> stream at that epoch and all new post-handshake messages, be they >>>>>> KeyUpdate >>>>>> or anything else, must be enqueued for the new epoch. This is a little >>>>>> unfortunate because a TLS library which transparently KeyUpdates will >>>>>> then >>>>>> inadvertently introduce hiccups where post-handshake messages triggered >>>>>> by >>>>>> the application, like post-handshake auth, are blocked. >>>>>> >>>>>> >>>>>> >>>>>> That then suggests some more options for fixing the original problem. >>>>>> >>>>>> >>>>>> >>>>>> *7. Fix the sender's KeyUpdate criteria* >>>>>> >>>>>> >>>>>> >>>>>> We tell the sender to wait for all previous messages to be ACKed too. >>>>>> Fix the first paragraph of section 8 to say: >>>>>> >>>>>> >>>>>> >>>>>> > As with other handshake messages with no built-in response, >>>>>> KeyUpdates MUST be acknowledged. Acknowledgements are used to both >>>>>> control >>>>>> retransmission and transition to the next epoch. Implementations MUST NOT >>>>>> send records with the new keys until the KeyUpdate *and all >>>>>> preceding messages* have been acknowledged. This facilitates epoch >>>>>> reconstruction (Section 4.2.2) and avoids too many epochs in active use, >>>>>> by >>>>>> ensuring the peer has processed the KeyUpdate and started receiving at >>>>>> the >>>>>> new epoch. >>>>>> >>>>>> > >>>>>> >>>>>> > A KeyUpdate message terminates the post-handshake stream in an >>>>>> epoch. After sending KeyUpdate in an epoch, implementations MUST NOT send >>>>>> any new post-handshake messages in that epoch. Note that, if the >>>>>> implementation has sent KeyUpdate but is waiting for an ACK, the next >>>>>> epoch >>>>>> is not yet active. In this case, subsequent post-handshake messages may >>>>>> not >>>>>> be sent until receiving the ACK. >>>>>> >>>>>> >>>>>> >>>>>> And then on the receiver side, we leave things as-is. If the sender >>>>>> implemented the old semantics AND had multiple post-handshake >>>>>> transactions >>>>>> in parallel, it might update keys too early and then we get into the >>>>>> situation described in (1). We then declare that, if this happens, and >>>>>> the >>>>>> sender gets confused as a result, that's the sender's fault. Hopefully >>>>>> this >>>>>> is not rare enough (did anyone even implement 5.8.4, or does everyone >>>>>> just >>>>>> serialize their post-handshake transitions?) to not be a serious protocol >>>>>> break? That risk aside, this option seems the most in spirit with the >>>>>> current design to me. >>>>>> >>>>>> >>>>>> >>>>>> *8. Decouple post-handshake retransmissions from epochs* >>>>>> >>>>>> >>>>>> >>>>>> If we instead say that the same epoch rule only applies for the >>>>>> handshake, and not post-handshake messages, I think option 5 (process >>>>>> KeyUpdate out of order) might become viable? I'm not sure. Either way, >>>>>> this >>>>>> seems like a significant protocol break, so I don't think this is an >>>>>> option >>>>>> until some hypothetical DTLS 1.4. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Apr 12, 2024 at 6:59 PM David Benjamin <david...@chromium.org> >>>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> >>>>>> >>>>>> This is going to be a bit long. In short, DTLS 1.3 KeyUpdates seem to >>>>>> conflate the peer *receiving* the KeyUpdate with the peer >>>>>> *processing* the KeyUpdate, in ways that appear to break some >>>>>> assumptions made by the protocol design. >>>>>> >>>>>> >>>>>> >>>>>> *When to switch keys in KeyUpdate* >>>>>> >>>>>> >>>>>> >>>>>> So, first, DTLS 1.3, unlike TLS 1.3, applies the KeyUpdate on the >>>>>> ACK, not when the KeyUpdate is sent. This makes sense because KeyUpdate >>>>>> records are not intrinsically ordered with app data records sent after >>>>>> them: >>>>>> >>>>>> >>>>>> >>>>>> > As with other handshake messages with no built-in response, >>>>>> KeyUpdates MUST be acknowledged. In order to facilitate epoch >>>>>> reconstruction (Section 4.2.2), implementations MUST NOT send records >>>>>> with >>>>>> the new keys or send a new KeyUpdate until the previous KeyUpdate has >>>>>> been >>>>>> acknowledged (this avoids having too many epochs in active use). >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-8-1 >>>>>> >>>>>> >>>>>> >>>>>> Now, the parenthetical says this is to avoid having too many epochs >>>>>> in active use, but it appears that there are stronger assumptions on >>>>>> this: >>>>>> >>>>>> >>>>>> >>>>>> > After the handshake is complete, if the epoch bits do not match >>>>>> those from the current epoch, implementations SHOULD use the most recent >>>>>> * >>>>>> *past** epoch which has matching bits, and then reconstruct the >>>>>> sequence number for that epoch as described above. >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.2-3 >>>>>> >>>>>> (emphasis mine) >>>>>> >>>>>> >>>>>> >>>>>> > After the handshake, implementations MUST use the highest available >>>>>> sending epoch [to send ACKs] >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-7-7 >>>>>> >>>>>> >>>>>> >>>>>> These two snippets imply the protocol wants the peer to definitely >>>>>> have installed the new keys before you start using them. This makes sense >>>>>> because sending stuff the peer can't decrypt is pretty silly. As an >>>>>> aside, >>>>>> DTLS 1.3 retains this text from DTLS 1.2: >>>>>> >>>>>> >>>>>> >>>>>> > Conversely, it is possible for records that are protected with the >>>>>> new epoch to be received prior to the completion of a handshake. For >>>>>> instance, the server may send its Finished message and then start >>>>>> transmitting data. Implementations MAY either buffer or discard such >>>>>> records, though when DTLS is used over reliable transports (e.g., SCTP >>>>>> [RFC4960]), they SHOULD be buffered and processed once the handshake >>>>>> completes. >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-4.2.1-2 >>>>>> >>>>>> >>>>>> The text from DTLS 1.2 talks about *a* handshake, which presumably >>>>>> refers to rekeying via renegotiation. But in DTLS 1.3, the epoch >>>>>> reconstruction rule and the KeyUpdate rule mean this is only possible >>>>>> during the handshake, when you see epoch 4 and expect epoch 0-3. The >>>>>> steady >>>>>> state rekeying mechanism never hits this case. (This is a reasonable >>>>>> change >>>>>> because there's no sense in unnecessarily introducing blips where the >>>>>> connection is less tolerant of reordering.) >>>>>> >>>>>> >>>>>> >>>>>> *Buffered handshake messages* >>>>>> >>>>>> >>>>>> >>>>>> Okay, so KeyUpdates want to wait for the recipient to install keys, >>>>>> except we don't seem to actually achieve this! Section 5.2 says: >>>>>> >>>>>> >>>>>> >>>>>> > DTLS implementations maintain (at least notionally) a >>>>>> next_receive_seq counter. This counter is initially set to zero. When a >>>>>> handshake message is received, if its message_seq value matches >>>>>> next_receive_seq, next_receive_seq is incremented and the message is >>>>>> processed. If the sequence number is less than next_receive_seq, the >>>>>> message MUST be discarded. If the sequence number is greater than >>>>>> next_receive_seq, the implementation SHOULD queue the message but MAY >>>>>> discard it. (This is a simple space/bandwidth trade-off). >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-7 >>>>>> >>>>>> >>>>>> >>>>>> I assume this is intended to apply to post-handshake messages too. >>>>>> (See below for a discussion of the alternative.) But that means that, >>>>>> when >>>>>> you receive a KeyUpdate, you might not immediately process it. Suppose >>>>>> next_receive_seq is 5, and the peer sends NewSessionTicket(5), >>>>>> NewSessionTicket(6), and KeyUpdate(7). 5 is lost, but 6 and 7 come in, >>>>>> perhaps even in the same record which means that you're forced to ACK >>>>>> both >>>>>> or neither. But suppose the implementation is willing to buffer 3 >>>>>> messages >>>>>> ahead, so it ACKs the 6+7 record, by the rules in section 7, which >>>>>> permits >>>>>> ACKing fragments that were buffered and not yet processed. >>>>>> >>>>>> >>>>>> >>>>>> That means the peer will switch keys and now all subsequent records >>>>>> from them will come from epoch N+1. But the sender is not ready for N+1 >>>>>> yet, so we contradict everything above. We also contradict this >>>>>> parenthetical in section 8: >>>>>> >>>>>> >>>>>> >>>>>> > Due to loss and/or reordering, DTLS 1.3 implementations may receive >>>>>> a record with an older epoch than the current one (the requirements above >>>>>> preclude receiving a newer record). >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-8-2 >>>>>> >>>>>> >>>>>> >>>>>> I assume then that this was not actually what was intended. >>>>>> >>>>>> >>>>>> >>>>>> *Options (and non-options)* >>>>>> >>>>>> >>>>>> >>>>>> Assuming I'm reading this right, we seem to have made a mess of >>>>>> things. The sender could avoid this by only allowing one active >>>>>> post-handshake transaction at a time and serializing them, at the cost of >>>>>> taking a round-trip for each. But the receiver needs to account for all >>>>>> possible senders, so that doesn't help. Some options that come to mind: >>>>>> >>>>>> >>>>>> >>>>>> *1. Accept that the sender updates its keys too early* >>>>>> >>>>>> >>>>>> >>>>>> Apart from contradicting most of the specification text, the protocol >>>>>> doesn't *break* per se if you just allow the peer to switch keys >>>>>> early in this buffered KeyUpdate case. We *merely* contradict all of >>>>>> the explanatory text and introduce a bunch of cases that the >>>>>> specification >>>>>> suggests are impossible. :-) Also the connection quality is poor. >>>>>> >>>>>> >>>>>> >>>>>> The sender will use epoch N+1 at a point when the peer is on N. But >>>>>> epoch reconstruction will misread it as N-3 instead of N+1, and either >>>>>> way >>>>>> you won't have the keys to decrypt it yet! The connection is interrupted >>>>>> (and with all packets discarded because epoch reconstruction fails!) >>>>>> until >>>>>> the peer retransmits 5 and you catch up. Until then, not only will you >>>>>> not >>>>>> receive application data, but you also won't receive ACKs. This also >>>>>> adds a >>>>>> subtle corner case on the sender side: the sender cannot discard the old >>>>>> sending keys because it still has unACKed messages from the previous >>>>>> epoch >>>>>> to retransmit, but this is not called out in section 8. Section 8 only >>>>>> discusses the receiver needing to retain the old epoch. >>>>>> >>>>>> >>>>>> This seems not great. Also it contradicts much of the text in the >>>>>> spec, including section 8 explicitly saying this case cannot happen. >>>>>> >>>>>> >>>>>> >>>>>> *2. Never ACK buffered KeyUpdates* >>>>>> >>>>>> >>>>>> >>>>>> We can say that KeyUpdates are special and, unless you're willing to >>>>>> process them immediately, you must not ACK the records containing them. >>>>>> This means you might under-ACK and the peer might over-retransmit, but >>>>>> seems not fatal. This also seems a little hairy to implement if you want >>>>>> to >>>>>> avoid under-ACKing unnecessarily. You might have message >>>>>> NewSessionTicket(6) buffered and then receive a record with >>>>>> NewSessionTicket(5) and KeyUpdate(7). That record may appear unACKable, >>>>>> but >>>>>> it's fine because you'll immediately process 5 then 6 then 7... unless >>>>>> your >>>>>> NewSessionTicket process is asynchronous, in which case it might not be? >>>>>> >>>>>> >>>>>> >>>>>> Despite all that mess, this seems the most viable option? >>>>>> >>>>>> >>>>>> >>>>>> *3. Declare this situation a sender error* >>>>>> >>>>>> >>>>>> >>>>>> We could say this is not allowed and senders MUST NOT send KeyUpdate >>>>>> if there are any outstanding post-handshake messages. And then the >>>>>> receiver >>>>>> should fail with unexpected_message if it ever receives KeyUpdate at a >>>>>> future message_seq. But as the RFC is already published, I don't know if >>>>>> this is compatible with existing implementations. >>>>>> >>>>>> >>>>>> >>>>>> *4. Explicit KeyUpdateAck message* >>>>>> >>>>>> >>>>>> >>>>>> We could have made a KeyUpdateAck message to signal that you've >>>>>> processed a KeyUpdate, not just sent it. But that's a protocol change and >>>>>> the RFC is stamped, so it's too late now. >>>>>> >>>>>> >>>>>> >>>>>> *5. Process KeyUpdate out of order* >>>>>> >>>>>> >>>>>> >>>>>> We could say that the receiver doesn't buffer KeyUpdate. It just goes >>>>>> ahead and processes it immediately to install epoch N+1. This seems like >>>>>> it >>>>>> would address the issue but opens more cans of worms. Now the receiver >>>>>> needs to keep the old epoch around for more than packet reorder, but also >>>>>> to pick up the retransmissions of the missing handshake messages. Also, >>>>>> by >>>>>> activating the new epoch, the receiver now allows the sender to KeyUpdate >>>>>> again, and again, and again. But, several epochs later, the holes in the >>>>>> message stream may remain unfilled, so we still need the old keys. >>>>>> Without >>>>>> further protocol rules, a sender could force the receiver to keep keys >>>>>> arbitrarily many records back. All this is, at best, a difficult case >>>>>> that >>>>>> is unlikely to be well-tested, and at worst get the implementation into >>>>>> some broken state and then misbehave badly. >>>>>> >>>>>> >>>>>> >>>>>> *6. Post-handshake transactions aren't ordered at all* >>>>>> >>>>>> >>>>>> >>>>>> It could be that my assumption above was wrong and the >>>>>> next_receive_seq discussion in 5.2 only applies to the handshake. After >>>>>> all, section 5.8.4 discusses how every post-handshake transaction >>>>>> duplicates the "state machine". Except it only says to duplicate the >>>>>> 5.8.1 >>>>>> state machine, and it's unclear ambiguous whether that includes the >>>>>> message_seq logic. >>>>>> >>>>>> >>>>>> >>>>>> However, going this direction seems to very quickly make a mess. If >>>>>> each post-handshake transaction handles message_seq independently, you >>>>>> cannot distinguish a retransmission from a new transaction. That seems >>>>>> quite bad, so presumably the intent was to use message_seq to distinguish >>>>>> those. (I.e. the intent can't have been to duplicate the message_seq >>>>>> state.) Indeed, we have: >>>>>> >>>>>> >>>>>> >>>>>> > However, in DTLS 1.3 the message_seq is not reset, to allow >>>>>> distinguishing a retransmission from a previously sent post-handshake >>>>>> message from a newly sent post-handshake message. >>>>>> >>>>>> https://www.rfc-editor.org/rfc/rfc9147.html#section-5.2-6 >>>>>> >>>>>> >>>>>> >>>>>> But if we distinguish with message_seq AND process transactions out >>>>>> of order, now receivers need to keep track of fairly complex state in >>>>>> case >>>>>> they process messages 5, 7, 9, 11, 13, 15, 17, ... but then only get the >>>>>> even ones later. And we'd need to define some kind of sliding window for >>>>>> what happens if you receive message_seq 9000 all of a sudden. And we >>>>>> import >>>>>> all the cross-epoch problems in option 5 above. None of that is in the >>>>>> text, so I assume this was not the intended reading, and I don't think we >>>>>> want to go that direction. :-) >>>>>> >>>>>> >>>>>> * Digression: ACK fate-sharing and flow control* >>>>>> >>>>>> >>>>>> >>>>>> All this alludes to another quirk that isn't a problem, but is a >>>>>> little non-obvious and warrants some discussion in the spec. Multiple >>>>>> handshake fragments may be packed into the same record, but ACKs apply to >>>>>> the whole record. If you receive a fragment for a message sequence too >>>>>> far >>>>>> into the future, you are permitted to discard the fragment. But if you >>>>>> discard *any* fragment, you cannot ACK the record, *even if there >>>>>> were fragments which you did process*. During the handshake, an >>>>>> implementation could avoid needing to make this decision by knowing the >>>>>> maximum size of a handshake flight. After the handshake, there is no >>>>>> inherent limit on how many NewSessionTickets the peer may choose to send >>>>>> in >>>>>> a row, and no flow control. >>>>>> >>>>>> >>>>>> >>>>>> QUIC ran into a similar issue here and said an implementation can >>>>>> choose an ad-hoc limit, after which it can choose to either wedge the >>>>>> post-handshake stream or return an error. >>>>>> >>>>>> https://github.com/quicwg/base-drafts/issues/1834 >>>>>> https://github.com/quicwg/base-drafts/pull/2524 >>>>>> >>>>>> >>>>>> >>>>>> I suspect the most practical outcome for DTLS (and arguably already >>>>>> supported by the existing text, but not very obviously), is to instead >>>>>> say >>>>>> the receiver just refuses to ACK stuff and, okay, maybe in some weird >>>>>> edge >>>>>> cases the receiver under-ACKs and then the sender over-retransmits, until >>>>>> things settle down. Whereas ACKs are a bit more tightly integrated with >>>>>> QUIC, so refusing to ACK a packet due to one bad frame is less of an >>>>>> option. Still, I think this would have been worth calling out in the >>>>>> text. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> So... did I read all this right? Did we indeed make a mess of this, >>>>>> or did I miss something? >>>>>> >>>>>> >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>> TLS mailing list >>>>> TLS@ietf.org >>>>> https://www.ietf.org/mailman/listinfo/tls >>>>> >>>> _______________________________________________ >> TLS mailing list >> TLS@ietf.org >> https://www.ietf.org/mailman/listinfo/tls >> >
_______________________________________________ TLS mailing list -- tls@ietf.org To unsubscribe send an email to tls-le...@ietf.org