Peter Xu <pet...@redhat.com> writes: > On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote: >> Peter Xu <pet...@redhat.com> writes: >> >> > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote: >> >> > In any case we'd still need some kind of a compatibility behavior for >> >> > the TLS bit stream emitted by older QEMU versions (which is always >> >> > improperly terminated). >> >> > >> >> >> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra >> >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on >> >> the destination and it gets stuck waiting for the >> >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always >> >> closes the connection before dst reaches the extra recv(). >> >> >> >> I test migration both ways with 2 previous QEMU versions and the >> >> gnutls_bye() series passes all tests. I also put an assert at >> >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The >> >> MULTIFD_FLAG_EOS should behave the same. >> > >> > Which are the versions you tried? As only 9.1 and 9.2 has 637280aeb2, so I >> > wonder if the same issue would hit too with 9.0 or older. >> >> Good point. 9.0 indeed breaks. >> >> > >> > I'd confess I feel unreliable relying on the side effect of 637280aeb2, >> > because fundamentally it works based on the fact that multifd threads need >> > to be kicked out by the main load thread SYNC event on dest QEMU to avoid >> > the readv() from going wrong. >> > >> >> We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is >> a bug that 1d457daf86 fixed. >> >> > What I'm not sure here is, is it sheer luck that the main channel SYNC will >> > always arrive _before_ pre-mature terminations of the multifd channels? It >> > sounds like it could also happen when the multifd channels got its >> > pre-mature termination early, before the main thread got the SYNC. >> >> You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that >> puts the recv thread in the "won't see the termination" state and that >> is serialized: >> >> SEND RECV >> -------------------------+---------------------------- >> 1 multifd_send_sync_main() >> 2 pending_sync==true, >> 3 send thread sends SYNC recv thread gets SYNC >> 4 <some work> recv gets stuck. >> 5 multifd_send_shutdown() <time passes> >> 6 shutdown() multifd_recv_shutdown() >> recv_terminate_threads() >> recv exits without recv() >> >> In other words, RECV would need to see the shutdown (6) before the SYNC >> (3), which I don't think it possible. > > Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot > to push the per-channel SYNC. I got it the other way round. Yeah if data > is always ordered with shutdown() effect on recv then it seems in order. > >> >> > >> > Maybe we still need a compat property at the end.. >> >> This is actually similar to preempt_pre_7_2, what about: >> >> /* >> * This variable only makes sense when set on the machine that is >> * the destination of a multifd migration with TLS enabled. It >> * affects the behavior of the last send->recv iteration with >> * regards to termination of the TLS session. Defaults to true. >> * >> * When set: >> * >> * - the destination QEMU instance can expect to never get a >> * GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error >> * message: "The TLS connection was non-properly terminated". >> * >> * When clear: >> * >> * - the destination QEMU instance can expect to see a >> * GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel >> * whenever the last recv() call of that channel happens after >> * the source QEMU instance has already issued shutdown() on the >> * channel. This is affected by (at least) commits 637280aeb2 >> * and 1d457daf86. > > If we want to reference them after all, we could use another sentence to > describe the effects: > > * Commit 637280aeb2 (since 9.1) introduced a side effect to cause > * pre-mature termination not happen, while commit 1d457daf86 > * (since 10.0) can unexpectedly re-expose the pre-mature > * termination issue. >
I'll add this. >> * >> * NOTE: Regardless of the state of this option, a premature >> * termination of the TLS connection might happen due to error at >> * any moment prior to the last send->recv iteration. >> */ >> bool multifd_clean_tls_termination; >> >> And I think the more straight-forward implementation is to incorporate >> Maciej's premature_ok patches (in some form), otherwise that option will >> have to take effect on the QIOChannel which is a layering violation. > > If we take Dan's comment into account: > > https://lore.kernel.org/r/z6i86e-hzjalx...@redhat.com > > It means whenever multifd recv thread invokes the iochannel API it will use > multifd_clean_tls_termination to decide QIO_CHANNEL_READ_RELAXED_EOF flag > to pass in. I hope this is not layer violation, or I could miss something.. Yes, we need that. > > So if we're on the same page we need that knob, to make this series easier > we could make it two steps: > > - Step 1: introduce the parameter and QIO_CHANNEL_READ_RELAXED_EOF, set > it default to false. > > - Step 2: Your other RFC series to implement gnutls_bye(), at last make > it a compat property and switch default true. > > Then Maciej only needs step 1, it looks to me. I'm sending everything as a v2 in a moment. We can cherry-pick from there.