Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Peter Xu writes: > On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote: >> Peter Xu writes: >> >> > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote: >> >> > In any case we'd still need some kind of a compatibility behavior for >> >> > the TLS bit stream emitted by older QEMU versions (which is always >> >> > improperly terminated). >> >> > >> >> >> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra >> >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on >> >> the destination and it gets stuck waiting for the >> >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always >> >> closes the connection before dst reaches the extra recv(). >> >> >> >> I test migration both ways with 2 previous QEMU versions and the >> >> gnutls_bye() series passes all tests. I also put an assert at >> >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The >> >> MULTIFD_FLAG_EOS should behave the same. >> > >> > Which are the versions you tried? As only 9.1 and 9.2 has 637280aeb2, so I >> > wonder if the same issue would hit too with 9.0 or older. >> >> Good point. 9.0 indeed breaks. >> >> > >> > I'd confess I feel unreliable relying on the side effect of 637280aeb2, >> > because fundamentally it works based on the fact that multifd threads need >> > to be kicked out by the main load thread SYNC event on dest QEMU to avoid >> > the readv() from going wrong. >> > >> >> We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is >> a bug that 1d457daf86 fixed. >> >> > What I'm not sure here is, is it sheer luck that the main channel SYNC will >> > always arrive _before_ pre-mature terminations of the multifd channels? It >> > sounds like it could also happen when the multifd channels got its >> > pre-mature termination early, before the main thread got the SYNC. >> >> You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that >> puts the recv thread in the "won't see the termination" state and that >> is serialized: >> >>SENDRECV >>-+ >> 1 multifd_send_sync_main() >> 2 pending_sync==true, >> 3 send thread sends SYNC recv thread gets SYNC >> 4 recv gets stuck. >> 5 multifd_send_shutdown() >> 6 shutdown() multifd_recv_shutdown() >>recv_terminate_threads() >>recv exits without recv() >> >> In other words, RECV would need to see the shutdown (6) before the SYNC >> (3), which I don't think it possible. > > Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot > to push the per-channel SYNC. I got it the other way round. Yeah if data > is always ordered with shutdown() effect on recv then it seems in order. > >> >> > >> > Maybe we still need a compat property at the end.. >> >> This is actually similar to preempt_pre_7_2, what about: >> >> /* >> * This variable only makes sense when set on the machine that is >> * the destination of a multifd migration with TLS enabled. It >> * affects the behavior of the last send->recv iteration with >> * regards to termination of the TLS session. Defaults to true. >> * >> * When set: >> * >> * - the destination QEMU instance can expect to never get a >> * GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error >> * message: "The TLS connection was non-properly terminated". >> * >> * When clear: >> * >> * - the destination QEMU instance can expect to see a >> * GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel >> * whenever the last recv() call of that channel happens after >> * the source QEMU instance has already issued shutdown() on the >> * channel. This is affected by (at least) commits 637280aeb2 >> * and 1d457daf86. > > If we want to reference them after all, we could use another sentence to > describe the effects: > >* Commit 637280aeb2 (since 9.1) introduced a side effect to cause >* pre-mature termination not happen, while commit 1d457daf86 >* (since 10.0) can unexpectedly re-expose the pre-mature >* termination issue. > I'll add this. >> * >> * NOTE: Regardless of the state of this option, a premature >> * termination of the TLS connection might happen due to error at >> * any moment prior to the last send->recv iteration. >> */ >> bool multifd_clean_tls_termination; >> >> And I think the more straight-forward implementation is to incorporate >> Maciej's premature_ok patches (in some form), otherwise that option will >> have to take effect on the QIOChannel which is a layering violation. > > If we take Dan's comment into account: > > https://lore.kernel.org/r/[email protected] > > It means whenever multifd recv thread invoke
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote: > Peter Xu writes: > > > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote: > >> > In any case we'd still need some kind of a compatibility behavior for > >> > the TLS bit stream emitted by older QEMU versions (which is always > >> > improperly terminated). > >> > > >> > >> There is no compat issue. For <= 9.2, QEMU is still doing an extra > >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on > >> the destination and it gets stuck waiting for the > >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always > >> closes the connection before dst reaches the extra recv(). > >> > >> I test migration both ways with 2 previous QEMU versions and the > >> gnutls_bye() series passes all tests. I also put an assert at > >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The > >> MULTIFD_FLAG_EOS should behave the same. > > > > Which are the versions you tried? As only 9.1 and 9.2 has 637280aeb2, so I > > wonder if the same issue would hit too with 9.0 or older. > > Good point. 9.0 indeed breaks. > > > > > I'd confess I feel unreliable relying on the side effect of 637280aeb2, > > because fundamentally it works based on the fact that multifd threads need > > to be kicked out by the main load thread SYNC event on dest QEMU to avoid > > the readv() from going wrong. > > > > We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is > a bug that 1d457daf86 fixed. > > > What I'm not sure here is, is it sheer luck that the main channel SYNC will > > always arrive _before_ pre-mature terminations of the multifd channels? It > > sounds like it could also happen when the multifd channels got its > > pre-mature termination early, before the main thread got the SYNC. > > You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that > puts the recv thread in the "won't see the termination" state and that > is serialized: > >SENDRECV >-+ > 1 multifd_send_sync_main() > 2 pending_sync==true, > 3 send thread sends SYNC recv thread gets SYNC > 4 recv gets stuck. > 5 multifd_send_shutdown() > 6 shutdown() multifd_recv_shutdown() >recv_terminate_threads() >recv exits without recv() > > In other words, RECV would need to see the shutdown (6) before the SYNC > (3), which I don't think it possible. Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot to push the per-channel SYNC. I got it the other way round. Yeah if data is always ordered with shutdown() effect on recv then it seems in order. > > > > > Maybe we still need a compat property at the end.. > > This is actually similar to preempt_pre_7_2, what about: > > /* > * This variable only makes sense when set on the machine that is > * the destination of a multifd migration with TLS enabled. It > * affects the behavior of the last send->recv iteration with > * regards to termination of the TLS session. Defaults to true. > * > * When set: > * > * - the destination QEMU instance can expect to never get a > * GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error > * message: "The TLS connection was non-properly terminated". > * > * When clear: > * > * - the destination QEMU instance can expect to see a > * GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel > * whenever the last recv() call of that channel happens after > * the source QEMU instance has already issued shutdown() on the > * channel. This is affected by (at least) commits 637280aeb2 > * and 1d457daf86. If we want to reference them after all, we could use another sentence to describe the effects: * Commit 637280aeb2 (since 9.1) introduced a side effect to cause * pre-mature termination not happen, while commit 1d457daf86 * (since 10.0) can unexpectedly re-expose the pre-mature * termination issue. > * > * NOTE: Regardless of the state of this option, a premature > * termination of the TLS connection might happen due to error at > * any moment prior to the last send->recv iteration. > */ > bool multifd_clean_tls_termination; > > And I think the more straight-forward implementation is to incorporate > Maciej's premature_ok patches (in some form), otherwise that option will > have to take effect on the QIOChannel which is a layering violation. If we take Dan's comment into account: https://lore.kernel.org/r/[email protected] It means whenever multifd recv thread invokes the iochannel API it will use multifd_clean_tls_termination to decide QIO_CHANNEL_READ_RELAXED_EOF flag to pass in. I hope this is not layer violation, or I co
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Peter Xu writes: > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote: >> > In any case we'd still need some kind of a compatibility behavior for >> > the TLS bit stream emitted by older QEMU versions (which is always >> > improperly terminated). >> > >> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on >> the destination and it gets stuck waiting for the >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always >> closes the connection before dst reaches the extra recv(). >> >> I test migration both ways with 2 previous QEMU versions and the >> gnutls_bye() series passes all tests. I also put an assert at >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The >> MULTIFD_FLAG_EOS should behave the same. > > Which are the versions you tried? As only 9.1 and 9.2 has 637280aeb2, so I > wonder if the same issue would hit too with 9.0 or older. Good point. 9.0 indeed breaks. > > I'd confess I feel unreliable relying on the side effect of 637280aeb2, > because fundamentally it works based on the fact that multifd threads need > to be kicked out by the main load thread SYNC event on dest QEMU to avoid > the readv() from going wrong. > We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is a bug that 1d457daf86 fixed. > What I'm not sure here is, is it sheer luck that the main channel SYNC will > always arrive _before_ pre-mature terminations of the multifd channels? It > sounds like it could also happen when the multifd channels got its > pre-mature termination early, before the main thread got the SYNC. You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that puts the recv thread in the "won't see the termination" state and that is serialized: SENDRECV -+ 1 multifd_send_sync_main() 2 pending_sync==true, 3 send thread sends SYNC recv thread gets SYNC 4 recv gets stuck. 5 multifd_send_shutdown() 6 shutdown() multifd_recv_shutdown() recv_terminate_threads() recv exits without recv() In other words, RECV would need to see the shutdown (6) before the SYNC (3), which I don't think it possible. > > Maybe we still need a compat property at the end.. This is actually similar to preempt_pre_7_2, what about: /* * This variable only makes sense when set on the machine that is * the destination of a multifd migration with TLS enabled. It * affects the behavior of the last send->recv iteration with * regards to termination of the TLS session. Defaults to true. * * When set: * * - the destination QEMU instance can expect to never get a * GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error * message: "The TLS connection was non-properly terminated". * * When clear: * * - the destination QEMU instance can expect to see a * GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel * whenever the last recv() call of that channel happens after * the source QEMU instance has already issued shutdown() on the * channel. This is affected by (at least) commits 637280aeb2 * and 1d457daf86. * * NOTE: Regardless of the state of this option, a premature * termination of the TLS connection might happen due to error at * any moment prior to the last send->recv iteration. */ bool multifd_clean_tls_termination; And I think the more straight-forward implementation is to incorporate Maciej's premature_ok patches (in some form), otherwise that option will have to take effect on the QIOChannel which is a layering violation.
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote: > > In any case we'd still need some kind of a compatibility behavior for > > the TLS bit stream emitted by older QEMU versions (which is always > > improperly terminated). > > > > There is no compat issue. For <= 9.2, QEMU is still doing an extra > multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on > the destination and it gets stuck waiting for the > RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always > closes the connection before dst reaches the extra recv(). > > I test migration both ways with 2 previous QEMU versions and the > gnutls_bye() series passes all tests. I also put an assert at > tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The > MULTIFD_FLAG_EOS should behave the same. Which are the versions you tried? As only 9.1 and 9.2 has 637280aeb2, so I wonder if the same issue would hit too with 9.0 or older. I'd confess I feel unreliable relying on the side effect of 637280aeb2, because fundamentally it works based on the fact that multifd threads need to be kicked out by the main load thread SYNC event on dest QEMU to avoid the readv() from going wrong. What I'm not sure here is, is it sheer luck that the main channel SYNC will always arrive _before_ pre-mature terminations of the multifd channels? It sounds like it could also happen when the multifd channels got its pre-mature termination early, before the main thread got the SYNC. Maybe we still need a compat property at the end.. -- Peter Xu
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 6.02.2025 18:32, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: On 6.02.2025 16:20, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: (..) Even if non-TLS behaved the same, why would we choose to port a bug to the TLS implementation? We could of course decide at this point to punt the problem forward and when it shows up, we'd have to go implement gnutls_bye() to allow the distinction between good-EOF/bad-EOF or maybe add an extra sync at the end of migration to make sure the last recv() call is only started after the source has already shutdown() the channel. If we do some kind of a premature EOF detection then it should probably work for the non-TLS case too (since that's probably by far the most common use case). So adding some MULTIFD_FLAG_EOS would make the most sense and would work even with QIO_CHANNEL_READ_RELAXED_EOF being set. Indeed, if MULTIFD_FLAG_EOS is not seen, the recv thread could treat any EOF as an error. The question is whether we can add that without disrupting multifd synchronization too much. In any case we'd still need some kind of a compatibility behavior for the TLS bit stream emitted by older QEMU versions (which is always improperly terminated). There is no compat issue. For <= 9.2, QEMU is still doing an extra multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on the destination and it gets stuck waiting for the RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always closes the connection before dst reaches the extra recv(). I test migration both ways with 2 previous QEMU versions and the gnutls_bye() series passes all tests. I also put an assert at tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The MULTIFD_FLAG_EOS should behave the same. If you are confident that properly terminating the TLS session by adding gnutls_bye() is the way to go then I'm fine with this - I hope @Peter and @Daniel are too. It's now only matter of how soon you can have these gnutls_bye() patches posted/merged since if I drop the premature_ok stuff the updated series will depend on them for passing the TLS tests. Thanks, Maciej
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
"Maciej S. Szmigiero" writes: > On 6.02.2025 16:20, Fabiano Rosas wrote: >> "Maciej S. Szmigiero" writes: >> >>> On 6.02.2025 15:13, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: > On 5.02.2025 21:42, Fabiano Rosas wrote: >> Fabiano Rosas writes: >> >>> Daniel P. Berrangé writes: >>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: > On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: >> On 3.02.2025 23:56, Peter Xu wrote: >>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero > wrote: >> On 3.02.2025 19:20, Peter Xu wrote: >>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero >>> wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. >>> >>> If this is an existing bug, we could add a Fixes. >> >> It is an existing issue but only uncovered by this patch set. >> >> As far as I can see it was always there, so it would need some >> thought where to point that Fixes tag. > > If there's no way to trigger a real functional bug anyway, it's > also ok we > omit the Fixes. > >>> Two pure questions.. >>> >>> - What is the correct way to terminate the TLS session >>> without this flag? >> >> I guess one would need to call gnutls_bye() like in this GnuTLS >> example: >> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >> >>> - Why this is only needed by multifd sessions? >> >> What uncovered the issue was switching the load threads to using >> migrate_set_error() instead of their own result variable >> (load_threads_ret) which you had requested during the previous >> patch set version review: >> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >> >> Turns out that the multifd receive code always returned >> error in the TLS case, just nothing was previously checking for >> that error presence. > > What I was curious is whether this issue also exists for the main > migration > channel when with tls, especially when e.g. multifd not enabled > at all. As > I don't see anywhere that qemu uses gnutls_bye() for any tls > session. > > I think it's a good to find that we overlooked this before.. and > IMHO it's > always good we could fix this. > > Does it mean we need proper gnutls_bye() somewhere? > > If we need an explicit gnutls_bye(), then I wonder if that should > be done > on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explic
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 6.02.2025 16:20, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: On 6.02.2025 15:13, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: On 5.02.2025 21:42, Fabiano Rosas wrote: Fabiano Rosas writes: Daniel P. Berrangé writes: On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 23:56, Peter Xu wrote: On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. I had a closer look, I do feel like such pre-mature termination is caused by explicit shutdown()s of the iochannels, looks like that can cause issue even after everything is sent. Then I noticed indeed multifd sender iochannels will get explicit shutdown()s since commit 077fbb5942, while we don't do that for the main channel. Maybe that is a major difference. Now I wonder whether we should shutdown() the channel at all if migration succeeded, because looks like it can cause tls session to interrupt even if the shutdown() is done after sent everything, and if so it'll explain why you hit the issue with tls. Then I can't see anything else reading the channel until it is closed in migration_incoming_state_destroy(). So most likely the main migration channel will never read far enough to reach that GNUTLS_E_PREMATURE_TERMINATION error. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? So basically have this patch extended to calling qio_channel_tls_set_premature_eof_okay() also on the main migration channel? If above theory can stand, then eof-okay could be a workaround papering over the real problem that we shouldn't always shutdown().. Could you have a look at below patch and see whether it can fix the problem you hit too, in replace of these two patches (including the p
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
"Maciej S. Szmigiero" writes: > On 6.02.2025 15:13, Fabiano Rosas wrote: >> "Maciej S. Szmigiero" writes: >> >>> On 5.02.2025 21:42, Fabiano Rosas wrote: Fabiano Rosas writes: > Daniel P. Berrangé writes: > >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 23:56, Peter Xu wrote: > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >> On 3.02.2025 21:20, Peter Xu wrote: >>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero > wrote: >> From: "Maciej S. Szmigiero" >> >> Multifd send channels are terminated by calling >> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >> multifd_send_terminate_threads(), which in the TLS case >> essentially >> calls shutdown(SHUT_RDWR) on the underlying raw socket. >> >> Unfortunately, this does not terminate the TLS session properly >> and >> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION >> error. >> >> The only reason why this wasn't causing migration failures is >> because >> the current migration code apparently does not check for >> migration >> error being set after the end of the multifd receive process. >> >> However, this will change soon so the multifd receive code has >> to be >> prepared to not return an error on such premature TLS session >> EOF. >> Use the newly introduced QIOChannelTLS method for that. >> >> It's worth noting that even if the sender were to be changed to >> terminate >> the TLS connection properly the receive side still needs to >> remain >> compatible with older QEMU bit stream which does not do this. > > If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. >>> >>> If there's no way to trigger a real functional bug anyway, it's >>> also ok we >>> omit the Fixes. >>> > Two pure questions.. > >- What is the correct way to terminate the TLS session > without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >- Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. >>> >>> What I was curious is whether this issue also exists for the main >>> migration >>> channel when with tls, especially when e.g. multifd not enabled at >>> all. As >>> I don't see anywhere that qemu uses gnutls_bye() for any tls >>> session. >>> >>> I think it's a good to find that we overlooked this before.. and >>> IMHO it's >>> always good we could fix this. >>> >>> Does it mean we need proper gnutls_bye() somewhere? >>> >>> If we need an explicit gnutls_bye(), then I wonder if that should >>> be done >>> on the main channel as well. >> >> That's a good question and looking at the code >> qemu_loadvm_state_main() exits >> on receiving "QEMU_VM_EOF" section (that's different from receiving >> socket EOF) >> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with >> explicit size >> in qemu_loadvm_state() - so still not until channel EOF. > > I had a closer look, I do feel like such pre-mature termination is > caused > by explicit shutdown()s of the iochannels, looks like that can cause > issue > even after every
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 6.02.2025 15:13, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: On 5.02.2025 21:42, Fabiano Rosas wrote: Fabiano Rosas writes: Daniel P. Berrangé writes: On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 23:56, Peter Xu wrote: On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. I had a closer look, I do feel like such pre-mature termination is caused by explicit shutdown()s of the iochannels, looks like that can cause issue even after everything is sent. Then I noticed indeed multifd sender iochannels will get explicit shutdown()s since commit 077fbb5942, while we don't do that for the main channel. Maybe that is a major difference. Now I wonder whether we should shutdown() the channel at all if migration succeeded, because looks like it can cause tls session to interrupt even if the shutdown() is done after sent everything, and if so it'll explain why you hit the issue with tls. Then I can't see anything else reading the channel until it is closed in migration_incoming_state_destroy(). So most likely the main migration channel will never read far enough to reach that GNUTLS_E_PREMATURE_TERMINATION error. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? So basically have this patch extended to calling qio_channel_tls_set_premature_eof_okay() also on the main migration channel? If above theory can stand, then eof-okay could be a workaround papering over the real problem that we shouldn't always shutdown().. Could you have a look at below patch and see whether it can fix the problem you hit too, in replace of these two patches (including the previous iochannel change)? Unfortunately, the patch below does not fix the
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
"Maciej S. Szmigiero" writes: > On 5.02.2025 21:42, Fabiano Rosas wrote: >> Fabiano Rosas writes: >> >>> Daniel P. Berrangé writes: >>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: > On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: >> On 3.02.2025 23:56, Peter Xu wrote: >>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: >> On 3.02.2025 19:20, Peter Xu wrote: >>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. >>> >>> If this is an existing bug, we could add a Fixes. >> >> It is an existing issue but only uncovered by this patch set. >> >> As far as I can see it was always there, so it would need some >> thought where to point that Fixes tag. > > If there's no way to trigger a real functional bug anyway, it's also > ok we > omit the Fixes. > >>> Two pure questions.. >>> >>> - What is the correct way to terminate the TLS session >>> without this flag? >> >> I guess one would need to call gnutls_bye() like in this GnuTLS >> example: >> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >> >>> - Why this is only needed by multifd sessions? >> >> What uncovered the issue was switching the load threads to using >> migrate_set_error() instead of their own result variable >> (load_threads_ret) which you had requested during the previous >> patch set version review: >> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >> >> Turns out that the multifd receive code always returned >> error in the TLS case, just nothing was previously checking for >> that error presence. > > What I was curious is whether this issue also exists for the main > migration > channel when with tls, especially when e.g. multifd not enabled at > all. As > I don't see anywhere that qemu uses gnutls_bye() for any tls session. > > I think it's a good to find that we overlooked this before.. and IMHO > it's > always good we could fix this. > > Does it mean we need proper gnutls_bye() somewhere? > > If we need an explicit gnutls_bye(), then I wonder if that should be > done > on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. >>> >>> I had a closer look, I do feel like such pre-mature termination is >>> caused >>> by explicit shutdown()s of the iochannels, looks like that can cause >>> issue >>> even after everything is sent. Then I noticed indeed multifd sender >>> iochannels will get explicit shutdown()s since commit 077fbb5942, while >>> we >>> don't do that for the main channel. Maybe that is a major difference. >>> >>> Now I wonder whether we should shutdown() the channel at all if >>> migration >>> succeeded, because looks like it can cause tls session to i
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Peter Xu writes: > On Wed, Feb 05, 2025 at 05:42:37PM -0300, Fabiano Rosas wrote: >> Fabiano Rosas writes: >> >> > Daniel P. Berrangé writes: >> > >> >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: >> >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: >> >>> > On 3.02.2025 23:56, Peter Xu wrote: >> >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >> >>> > > > On 3.02.2025 21:20, Peter Xu wrote: >> >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero >> >>> > > > > wrote: >> >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote: >> >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. >> >>> > > > > > > Szmigiero wrote: >> >>> > > > > > > > From: "Maciej S. Szmigiero" >> >>> > > > > > > > >> >>> > > > > > > > Multifd send channels are terminated by calling >> >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >> >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case >> >>> > > > > > > > essentially >> >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. >> >>> > > > > > > > >> >>> > > > > > > > Unfortunately, this does not terminate the TLS session >> >>> > > > > > > > properly and >> >>> > > > > > > > the receive side sees this as a >> >>> > > > > > > > GNUTLS_E_PREMATURE_TERMINATION error. >> >>> > > > > > > > >> >>> > > > > > > > The only reason why this wasn't causing migration failures >> >>> > > > > > > > is because >> >>> > > > > > > > the current migration code apparently does not check for >> >>> > > > > > > > migration >> >>> > > > > > > > error being set after the end of the multifd receive >> >>> > > > > > > > process. >> >>> > > > > > > > >> >>> > > > > > > > However, this will change soon so the multifd receive code >> >>> > > > > > > > has to be >> >>> > > > > > > > prepared to not return an error on such premature TLS >> >>> > > > > > > > session EOF. >> >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that. >> >>> > > > > > > > >> >>> > > > > > > > It's worth noting that even if the sender were to be >> >>> > > > > > > > changed to terminate >> >>> > > > > > > > the TLS connection properly the receive side still needs >> >>> > > > > > > > to remain >> >>> > > > > > > > compatible with older QEMU bit stream which does not do >> >>> > > > > > > > this. >> >>> > > > > > > >> >>> > > > > > > If this is an existing bug, we could add a Fixes. >> >>> > > > > > >> >>> > > > > > It is an existing issue but only uncovered by this patch set. >> >>> > > > > > >> >>> > > > > > As far as I can see it was always there, so it would need some >> >>> > > > > > thought where to point that Fixes tag. >> >>> > > > > >> >>> > > > > If there's no way to trigger a real functional bug anyway, it's >> >>> > > > > also ok we >> >>> > > > > omit the Fixes. >> >>> > > > > >> >>> > > > > > > Two pure questions.. >> >>> > > > > > > >> >>> > > > > > > - What is the correct way to terminate the TLS session >> >>> > > > > > > without this flag? >> >>> > > > > > >> >>> > > > > > I guess one would need to call gnutls_bye() like in this >> >>> > > > > > GnuTLS example: >> >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >> >>> > > > > > >> >>> > > > > > > - Why this is only needed by multifd sessions? >> >>> > > > > > >> >>> > > > > > What uncovered the issue was switching the load threads to >> >>> > > > > > using >> >>> > > > > > migrate_set_error() instead of their own result variable >> >>> > > > > > (load_threads_ret) which you had requested during the previous >> >>> > > > > > patch set version review: >> >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >> >>> > > > > > >> >>> > > > > > Turns out that the multifd receive code always returned >> >>> > > > > > error in the TLS case, just nothing was previously checking for >> >>> > > > > > that error presence. >> >>> > > > > >> >>> > > > > What I was curious is whether this issue also exists for the >> >>> > > > > main migration >> >>> > > > > channel when with tls, especially when e.g. multifd not enabled >> >>> > > > > at all. As >> >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls >> >>> > > > > session. >> >>> > > > > >> >>> > > > > I think it's a good to find that we overlooked this before.. and >> >>> > > > > IMHO it's >> >>> > > > > always good we could fix this. >> >>> > > > > >> >>> > > > > Does it mean we need proper gnutls_bye() somewhere? >> >>> > > > > >> >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that >> >>> > > > > should be done >> >>> > > > > on the main channel as well. >> >>> > > > >> >>> > > > That's a good question and looking at the code >> >>> > > > qemu_loadvm_state_main() exits >> >>> > > > on receiving "QEMU_VM_EOF" section (that's different from >> >>> > > > receiving socke
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Wed, Feb 05, 2025 at 05:42:37PM -0300, Fabiano Rosas wrote: > Fabiano Rosas writes: > > > Daniel P. Berrangé writes: > > > >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: > >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: > >>> > On 3.02.2025 23:56, Peter Xu wrote: > >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: > >>> > > > On 3.02.2025 21:20, Peter Xu wrote: > >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero > >>> > > > > wrote: > >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote: > >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero > >>> > > > > > > wrote: > >>> > > > > > > > From: "Maciej S. Szmigiero" > >>> > > > > > > > > >>> > > > > > > > Multifd send channels are terminated by calling > >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case > >>> > > > > > > > essentially > >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > >>> > > > > > > > > >>> > > > > > > > Unfortunately, this does not terminate the TLS session > >>> > > > > > > > properly and > >>> > > > > > > > the receive side sees this as a > >>> > > > > > > > GNUTLS_E_PREMATURE_TERMINATION error. > >>> > > > > > > > > >>> > > > > > > > The only reason why this wasn't causing migration failures > >>> > > > > > > > is because > >>> > > > > > > > the current migration code apparently does not check for > >>> > > > > > > > migration > >>> > > > > > > > error being set after the end of the multifd receive > >>> > > > > > > > process. > >>> > > > > > > > > >>> > > > > > > > However, this will change soon so the multifd receive code > >>> > > > > > > > has to be > >>> > > > > > > > prepared to not return an error on such premature TLS > >>> > > > > > > > session EOF. > >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that. > >>> > > > > > > > > >>> > > > > > > > It's worth noting that even if the sender were to be > >>> > > > > > > > changed to terminate > >>> > > > > > > > the TLS connection properly the receive side still needs to > >>> > > > > > > > remain > >>> > > > > > > > compatible with older QEMU bit stream which does not do > >>> > > > > > > > this. > >>> > > > > > > > >>> > > > > > > If this is an existing bug, we could add a Fixes. > >>> > > > > > > >>> > > > > > It is an existing issue but only uncovered by this patch set. > >>> > > > > > > >>> > > > > > As far as I can see it was always there, so it would need some > >>> > > > > > thought where to point that Fixes tag. > >>> > > > > > >>> > > > > If there's no way to trigger a real functional bug anyway, it's > >>> > > > > also ok we > >>> > > > > omit the Fixes. > >>> > > > > > >>> > > > > > > Two pure questions.. > >>> > > > > > > > >>> > > > > > > - What is the correct way to terminate the TLS session > >>> > > > > > > without this flag? > >>> > > > > > > >>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS > >>> > > > > > example: > >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > >>> > > > > > > >>> > > > > > > - Why this is only needed by multifd sessions? > >>> > > > > > > >>> > > > > > What uncovered the issue was switching the load threads to using > >>> > > > > > migrate_set_error() instead of their own result variable > >>> > > > > > (load_threads_ret) which you had requested during the previous > >>> > > > > > patch set version review: > >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > >>> > > > > > > >>> > > > > > Turns out that the multifd receive code always returned > >>> > > > > > error in the TLS case, just nothing was previously checking for > >>> > > > > > that error presence. > >>> > > > > > >>> > > > > What I was curious is whether this issue also exists for the main > >>> > > > > migration > >>> > > > > channel when with tls, especially when e.g. multifd not enabled > >>> > > > > at all. As > >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls > >>> > > > > session. > >>> > > > > > >>> > > > > I think it's a good to find that we overlooked this before.. and > >>> > > > > IMHO it's > >>> > > > > always good we could fix this. > >>> > > > > > >>> > > > > Does it mean we need proper gnutls_bye() somewhere? > >>> > > > > > >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should > >>> > > > > be done > >>> > > > > on the main channel as well. > >>> > > > > >>> > > > That's a good question and looking at the code > >>> > > > qemu_loadvm_state_main() exits > >>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving > >>> > > > socket EOF) > >>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with > >>> > > > explicit size > >>> > > > in qemu_loadvm_s
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 5.02.2025 21:42, Fabiano Rosas wrote: Fabiano Rosas writes: Daniel P. Berrangé writes: On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 23:56, Peter Xu wrote: On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. I had a closer look, I do feel like such pre-mature termination is caused by explicit shutdown()s of the iochannels, looks like that can cause issue even after everything is sent. Then I noticed indeed multifd sender iochannels will get explicit shutdown()s since commit 077fbb5942, while we don't do that for the main channel. Maybe that is a major difference. Now I wonder whether we should shutdown() the channel at all if migration succeeded, because looks like it can cause tls session to interrupt even if the shutdown() is done after sent everything, and if so it'll explain why you hit the issue with tls. Then I can't see anything else reading the channel until it is closed in migration_incoming_state_destroy(). So most likely the main migration channel will never read far enough to reach that GNUTLS_E_PREMATURE_TERMINATION error. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? So basically have this patch extended to calling qio_channel_tls_set_premature_eof_okay() also on the main migration channel? If above theory can stand, then eof-okay could be a workaround papering over the real problem that we shouldn't always shutdown().. Could you have a look at below patch and see whether it can fix the problem you hit too, in replace of these two patches (including the previous iochannel change)? Unfortunately, the patch below does not fix the problem: qemu-system-x86_64: Cannot read from TLS channel: The TLS connectio
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Fabiano Rosas writes: > Daniel P. Berrangé writes: > >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: >>> > On 3.02.2025 23:56, Peter Xu wrote: >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >>> > > > On 3.02.2025 21:20, Peter Xu wrote: >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote: >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero >>> > > > > > > wrote: >>> > > > > > > > From: "Maciej S. Szmigiero" >>> > > > > > > > >>> > > > > > > > Multifd send channels are terminated by calling >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case >>> > > > > > > > essentially >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. >>> > > > > > > > >>> > > > > > > > Unfortunately, this does not terminate the TLS session >>> > > > > > > > properly and >>> > > > > > > > the receive side sees this as a >>> > > > > > > > GNUTLS_E_PREMATURE_TERMINATION error. >>> > > > > > > > >>> > > > > > > > The only reason why this wasn't causing migration failures is >>> > > > > > > > because >>> > > > > > > > the current migration code apparently does not check for >>> > > > > > > > migration >>> > > > > > > > error being set after the end of the multifd receive process. >>> > > > > > > > >>> > > > > > > > However, this will change soon so the multifd receive code >>> > > > > > > > has to be >>> > > > > > > > prepared to not return an error on such premature TLS session >>> > > > > > > > EOF. >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that. >>> > > > > > > > >>> > > > > > > > It's worth noting that even if the sender were to be changed >>> > > > > > > > to terminate >>> > > > > > > > the TLS connection properly the receive side still needs to >>> > > > > > > > remain >>> > > > > > > > compatible with older QEMU bit stream which does not do this. >>> > > > > > > >>> > > > > > > If this is an existing bug, we could add a Fixes. >>> > > > > > >>> > > > > > It is an existing issue but only uncovered by this patch set. >>> > > > > > >>> > > > > > As far as I can see it was always there, so it would need some >>> > > > > > thought where to point that Fixes tag. >>> > > > > >>> > > > > If there's no way to trigger a real functional bug anyway, it's >>> > > > > also ok we >>> > > > > omit the Fixes. >>> > > > > >>> > > > > > > Two pure questions.. >>> > > > > > > >>> > > > > > > - What is the correct way to terminate the TLS session >>> > > > > > > without this flag? >>> > > > > > >>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS >>> > > > > > example: >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >>> > > > > > >>> > > > > > > - Why this is only needed by multifd sessions? >>> > > > > > >>> > > > > > What uncovered the issue was switching the load threads to using >>> > > > > > migrate_set_error() instead of their own result variable >>> > > > > > (load_threads_ret) which you had requested during the previous >>> > > > > > patch set version review: >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >>> > > > > > >>> > > > > > Turns out that the multifd receive code always returned >>> > > > > > error in the TLS case, just nothing was previously checking for >>> > > > > > that error presence. >>> > > > > >>> > > > > What I was curious is whether this issue also exists for the main >>> > > > > migration >>> > > > > channel when with tls, especially when e.g. multifd not enabled at >>> > > > > all. As >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls >>> > > > > session. >>> > > > > >>> > > > > I think it's a good to find that we overlooked this before.. and >>> > > > > IMHO it's >>> > > > > always good we could fix this. >>> > > > > >>> > > > > Does it mean we need proper gnutls_bye() somewhere? >>> > > > > >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should >>> > > > > be done >>> > > > > on the main channel as well. >>> > > > >>> > > > That's a good question and looking at the code >>> > > > qemu_loadvm_state_main() exits >>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving >>> > > > socket EOF) >>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with >>> > > > explicit size >>> > > > in qemu_loadvm_state() - so still not until channel EOF. >>> > > >>> > > I had a closer look, I do feel like such pre-mature termination is >>> > > caused >>> > > by explicit shutdown()s of the iochannels, looks like that can cause >>> > > issue >>> > > even after everything is sent. Then I noticed indeed multifd sender >>> > > io
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Daniel P. Berrangé writes: > On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: >> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: >> > On 3.02.2025 23:56, Peter Xu wrote: >> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >> > > > On 3.02.2025 21:20, Peter Xu wrote: >> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: >> > > > > > On 3.02.2025 19:20, Peter Xu wrote: >> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero >> > > > > > > wrote: >> > > > > > > > From: "Maciej S. Szmigiero" >> > > > > > > > >> > > > > > > > Multifd send channels are terminated by calling >> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >> > > > > > > > multifd_send_terminate_threads(), which in the TLS case >> > > > > > > > essentially >> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. >> > > > > > > > >> > > > > > > > Unfortunately, this does not terminate the TLS session >> > > > > > > > properly and >> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION >> > > > > > > > error. >> > > > > > > > >> > > > > > > > The only reason why this wasn't causing migration failures is >> > > > > > > > because >> > > > > > > > the current migration code apparently does not check for >> > > > > > > > migration >> > > > > > > > error being set after the end of the multifd receive process. >> > > > > > > > >> > > > > > > > However, this will change soon so the multifd receive code has >> > > > > > > > to be >> > > > > > > > prepared to not return an error on such premature TLS session >> > > > > > > > EOF. >> > > > > > > > Use the newly introduced QIOChannelTLS method for that. >> > > > > > > > >> > > > > > > > It's worth noting that even if the sender were to be changed >> > > > > > > > to terminate >> > > > > > > > the TLS connection properly the receive side still needs to >> > > > > > > > remain >> > > > > > > > compatible with older QEMU bit stream which does not do this. >> > > > > > > >> > > > > > > If this is an existing bug, we could add a Fixes. >> > > > > > >> > > > > > It is an existing issue but only uncovered by this patch set. >> > > > > > >> > > > > > As far as I can see it was always there, so it would need some >> > > > > > thought where to point that Fixes tag. >> > > > > >> > > > > If there's no way to trigger a real functional bug anyway, it's also >> > > > > ok we >> > > > > omit the Fixes. >> > > > > >> > > > > > > Two pure questions.. >> > > > > > > >> > > > > > > - What is the correct way to terminate the TLS session >> > > > > > > without this flag? >> > > > > > >> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS >> > > > > > example: >> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >> > > > > > >> > > > > > > - Why this is only needed by multifd sessions? >> > > > > > >> > > > > > What uncovered the issue was switching the load threads to using >> > > > > > migrate_set_error() instead of their own result variable >> > > > > > (load_threads_ret) which you had requested during the previous >> > > > > > patch set version review: >> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >> > > > > > >> > > > > > Turns out that the multifd receive code always returned >> > > > > > error in the TLS case, just nothing was previously checking for >> > > > > > that error presence. >> > > > > >> > > > > What I was curious is whether this issue also exists for the main >> > > > > migration >> > > > > channel when with tls, especially when e.g. multifd not enabled at >> > > > > all. As >> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session. >> > > > > >> > > > > I think it's a good to find that we overlooked this before.. and >> > > > > IMHO it's >> > > > > always good we could fix this. >> > > > > >> > > > > Does it mean we need proper gnutls_bye() somewhere? >> > > > > >> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be >> > > > > done >> > > > > on the main channel as well. >> > > > >> > > > That's a good question and looking at the code >> > > > qemu_loadvm_state_main() exits >> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving >> > > > socket EOF) >> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with >> > > > explicit size >> > > > in qemu_loadvm_state() - so still not until channel EOF. >> > > >> > > I had a closer look, I do feel like such pre-mature termination is caused >> > > by explicit shutdown()s of the iochannels, looks like that can cause >> > > issue >> > > even after everything is sent. Then I noticed indeed multifd sender >> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while >> > > we >> > > don't do that for the main channel. Maybe that is a major dif
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 4.02.2025 19:25, Fabiano Rosas wrote: Peter Xu writes: On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote: On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? - Why this is only needed by multifd sessions? Graceful TLS termination (via gnutls_bye()) should only be important to security if the QEMU protocol in question does not know how much data it is expecting to recieve. ie it cannot otherwise distinguish between an expected EOF, and a premature EOF triggered by an attacker. If the migration protocol has sufficient info to know when a chanel is expected to see EOF, then we should stop trying to read from the TLS channel before seeing the underlying EOF. Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that migration will still fail corretly in the case of a malicious attack causing premature termination. If there's a risk that migration may succeed, but with incomplete data, then we would need the full gnutls_bye dance. IIUC that's not required for migration then, because migration should know exactly how much data to receive, and migration should need to verify that and fail if the received data didn't match the expectation along the way. We also have QEMU_VM_EOF as the end mark of stream. The migration overall can detect whether EOF should have been reached, but multifd threads cannot. If one multifd channel experiences an issue and sees a premature termination, but ignores it, then that's a hang in QEMU because nothing provided the syncs needed (p->sem_sync, most likely). I think allowing premature TLS termination simply makes the TLS case function the same as non-TLS case here if the source were to close the multifd channel(s) early. Aren't we just postponing a bug? Thanks, Maciej
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Peter Xu writes: > On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote: >> On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote: >> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: >> > > From: "Maciej S. Szmigiero" >> > > >> > > Multifd send channels are terminated by calling >> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >> > > multifd_send_terminate_threads(), which in the TLS case essentially >> > > calls shutdown(SHUT_RDWR) on the underlying raw socket. >> > > >> > > Unfortunately, this does not terminate the TLS session properly and >> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. >> > > >> > > The only reason why this wasn't causing migration failures is because >> > > the current migration code apparently does not check for migration >> > > error being set after the end of the multifd receive process. >> > > >> > > However, this will change soon so the multifd receive code has to be >> > > prepared to not return an error on such premature TLS session EOF. >> > > Use the newly introduced QIOChannelTLS method for that. >> > > >> > > It's worth noting that even if the sender were to be changed to terminate >> > > the TLS connection properly the receive side still needs to remain >> > > compatible with older QEMU bit stream which does not do this. >> > >> > If this is an existing bug, we could add a Fixes. >> > >> > Two pure questions.. >> > >> > - What is the correct way to terminate the TLS session without this flag? >> > >> > - Why this is only needed by multifd sessions? >> >> Graceful TLS termination (via gnutls_bye()) should only be important to >> security if the QEMU protocol in question does not know how much data it >> is expecting to recieve. ie it cannot otherwise distinguish between an >> expected EOF, and a premature EOF triggered by an attacker. >> >> If the migration protocol has sufficient info to know when a chanel is >> expected to see EOF, then we should stop trying to read from the TLS >> channel before seeing the underlying EOF. >> >> Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that >> migration will still fail corretly in the case of a malicious attack >> causing premature termination. >> >> If there's a risk that migration may succeed, but with incomplete data, >> then we would need the full gnutls_bye dance. > > IIUC that's not required for migration then, because migration should know > exactly how much data to receive, and migration should need to verify that > and fail if the received data didn't match the expectation along the way. > We also have QEMU_VM_EOF as the end mark of stream. The migration overall can detect whether EOF should have been reached, but multifd threads cannot. If one multifd channel experiences an issue and sees a premature termination, but ignores it, then that's a hang in QEMU because nothing provided the syncs needed (p->sem_sync, most likely). Aren't we just postponing a bug? > > Said that, are we sure any pre-mature termination will only happen after > all data read in the receive buffer that was sent? > > To ask in another way: what happens if the source QEMU sends everything and > shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest > data to read, and (2) a pre-mature terminatino of TLS session in a read() > syscall. Would (2) be reported even before (1), or the order guaranteed > that read of the residue data in (1) always happen before (2) (considering > dest QEMU can be slow sometime on consuming the network buffers)? > > Thanks,
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Tue, Feb 04, 2025 at 04:12:15PM +, Daniel P. Berrangé wrote: > On Tue, Feb 04, 2025 at 11:02:28AM -0500, Peter Xu wrote: > > On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote: > > > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote: > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > > > From: "Maciej S. Szmigiero" > > > > > > > > > > Multifd send channels are terminated by calling > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > > multifd_send_terminate_threads(), which in the TLS case essentially > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > > > Unfortunately, this does not terminate the TLS session properly and > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > > > > > The only reason why this wasn't causing migration failures is because > > > > > the current migration code apparently does not check for migration > > > > > error being set after the end of the multifd receive process. > > > > > > > > > > However, this will change soon so the multifd receive code has to be > > > > > prepared to not return an error on such premature TLS session EOF. > > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > > terminate > > > > > the TLS connection properly the receive side still needs to remain > > > > > compatible with older QEMU bit stream which does not do this. > > > > > > > > If this is an existing bug, we could add a Fixes. > > > > > > > > Two pure questions.. > > > > > > > > - What is the correct way to terminate the TLS session without this > > > > flag? > > > > > > > > - Why this is only needed by multifd sessions? > > > > > > Graceful TLS termination (via gnutls_bye()) should only be important to > > > security if the QEMU protocol in question does not know how much data it > > > is expecting to recieve. ie it cannot otherwise distinguish between an > > > expected EOF, and a premature EOF triggered by an attacker. > > > > > > If the migration protocol has sufficient info to know when a chanel is > > > expected to see EOF, then we should stop trying to read from the TLS > > > channel before seeing the underlying EOF. > > > > > > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that > > > migration will still fail corretly in the case of a malicious attack > > > causing premature termination. > > > > > > If there's a risk that migration may succeed, but with incomplete data, > > > then we would need the full gnutls_bye dance. > > > > IIUC that's not required for migration then, because migration should know > > exactly how much data to receive, and migration should need to verify that > > and fail if the received data didn't match the expectation along the way. > > We also have QEMU_VM_EOF as the end mark of stream. > > > > Said that, are we sure any pre-mature termination will only happen after > > all data read in the receive buffer that was sent? > > > > To ask in another way: what happens if the source QEMU sends everything and > > shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest > > data to read, and (2) a pre-mature terminatino of TLS session in a read() > > syscall. Would (2) be reported even before (1), or the order guaranteed > > that read of the residue data in (1) always happen before (2) (considering > > dest QEMU can be slow sometime on consuming the network buffers)? > > That's not logically possible. > > In both (1) and (2) you are issuing a read() call to the TLS channel. > > The first read call(s) consume all incoming data. Only once the underlying > TCP socket read() returns 0, would GNUTLS see that it hasn't got any > TLS "bye" packet, and thus return GNUTLS_E_PREMATURE_TERMINATION from > the layered TLS read(). IOW, if you see GNUTLS_E_PREMATURE_TERMINATION > you know you have already read all received data off the socket. That looks all OK then. In that case we could set all migration TLS sessions to ignore premature terminations. Thanks, -- Peter Xu
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote: > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote: > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > From: "Maciej S. Szmigiero" > > > > > > Multifd send channels are terminated by calling > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > multifd_send_terminate_threads(), which in the TLS case essentially > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > Unfortunately, this does not terminate the TLS session properly and > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > The only reason why this wasn't causing migration failures is because > > > the current migration code apparently does not check for migration > > > error being set after the end of the multifd receive process. > > > > > > However, this will change soon so the multifd receive code has to be > > > prepared to not return an error on such premature TLS session EOF. > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > It's worth noting that even if the sender were to be changed to terminate > > > the TLS connection properly the receive side still needs to remain > > > compatible with older QEMU bit stream which does not do this. > > > > If this is an existing bug, we could add a Fixes. > > > > Two pure questions.. > > > > - What is the correct way to terminate the TLS session without this flag? > > > > - Why this is only needed by multifd sessions? > > Graceful TLS termination (via gnutls_bye()) should only be important to > security if the QEMU protocol in question does not know how much data it > is expecting to recieve. ie it cannot otherwise distinguish between an > expected EOF, and a premature EOF triggered by an attacker. > > If the migration protocol has sufficient info to know when a chanel is > expected to see EOF, then we should stop trying to read from the TLS > channel before seeing the underlying EOF. > > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that > migration will still fail corretly in the case of a malicious attack > causing premature termination. > > If there's a risk that migration may succeed, but with incomplete data, > then we would need the full gnutls_bye dance. IIUC that's not required for migration then, because migration should know exactly how much data to receive, and migration should need to verify that and fail if the received data didn't match the expectation along the way. We also have QEMU_VM_EOF as the end mark of stream. Said that, are we sure any pre-mature termination will only happen after all data read in the receive buffer that was sent? To ask in another way: what happens if the source QEMU sends everything and shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest data to read, and (2) a pre-mature terminatino of TLS session in a read() syscall. Would (2) be reported even before (1), or the order guaranteed that read of the residue data in (1) always happen before (2) (considering dest QEMU can be slow sometime on consuming the network buffers)? Thanks, -- Peter Xu
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Tue, Feb 04, 2025 at 11:02:28AM -0500, Peter Xu wrote: > On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote: > > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote: > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > > From: "Maciej S. Szmigiero" > > > > > > > > Multifd send channels are terminated by calling > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > multifd_send_terminate_threads(), which in the TLS case essentially > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > Unfortunately, this does not terminate the TLS session properly and > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > > > The only reason why this wasn't causing migration failures is because > > > > the current migration code apparently does not check for migration > > > > error being set after the end of the multifd receive process. > > > > > > > > However, this will change soon so the multifd receive code has to be > > > > prepared to not return an error on such premature TLS session EOF. > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > terminate > > > > the TLS connection properly the receive side still needs to remain > > > > compatible with older QEMU bit stream which does not do this. > > > > > > If this is an existing bug, we could add a Fixes. > > > > > > Two pure questions.. > > > > > > - What is the correct way to terminate the TLS session without this > > > flag? > > > > > > - Why this is only needed by multifd sessions? > > > > Graceful TLS termination (via gnutls_bye()) should only be important to > > security if the QEMU protocol in question does not know how much data it > > is expecting to recieve. ie it cannot otherwise distinguish between an > > expected EOF, and a premature EOF triggered by an attacker. > > > > If the migration protocol has sufficient info to know when a chanel is > > expected to see EOF, then we should stop trying to read from the TLS > > channel before seeing the underlying EOF. > > > > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that > > migration will still fail corretly in the case of a malicious attack > > causing premature termination. > > > > If there's a risk that migration may succeed, but with incomplete data, > > then we would need the full gnutls_bye dance. > > IIUC that's not required for migration then, because migration should know > exactly how much data to receive, and migration should need to verify that > and fail if the received data didn't match the expectation along the way. > We also have QEMU_VM_EOF as the end mark of stream. > > Said that, are we sure any pre-mature termination will only happen after > all data read in the receive buffer that was sent? > > To ask in another way: what happens if the source QEMU sends everything and > shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest > data to read, and (2) a pre-mature terminatino of TLS session in a read() > syscall. Would (2) be reported even before (1), or the order guaranteed > that read of the residue data in (1) always happen before (2) (considering > dest QEMU can be slow sometime on consuming the network buffers)? That's not logically possible. In both (1) and (2) you are issuing a read() call to the TLS channel. The first read call(s) consume all incoming data. Only once the underlying TCP socket read() returns 0, would GNUTLS see that it hasn't got any TLS "bye" packet, and thus return GNUTLS_E_PREMATURE_TERMINATION from the layered TLS read(). IOW, if you see GNUTLS_E_PREMATURE_TERMINATION you know you have already read all received data off the socket. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote: > On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: > > On 3.02.2025 23:56, Peter Xu wrote: > > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: > > > > On 3.02.2025 21:20, Peter Xu wrote: > > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > > > > > > On 3.02.2025 19:20, Peter Xu wrote: > > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero > > > > > > > wrote: > > > > > > > > From: "Maciej S. Szmigiero" > > > > > > > > > > > > > > > > Multifd send channels are terminated by calling > > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > > > > > multifd_send_terminate_threads(), which in the TLS case > > > > > > > > essentially > > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > > > > > > > > > Unfortunately, this does not terminate the TLS session properly > > > > > > > > and > > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION > > > > > > > > error. > > > > > > > > > > > > > > > > The only reason why this wasn't causing migration failures is > > > > > > > > because > > > > > > > > the current migration code apparently does not check for > > > > > > > > migration > > > > > > > > error being set after the end of the multifd receive process. > > > > > > > > > > > > > > > > However, this will change soon so the multifd receive code has > > > > > > > > to be > > > > > > > > prepared to not return an error on such premature TLS session > > > > > > > > EOF. > > > > > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > > > > > terminate > > > > > > > > the TLS connection properly the receive side still needs to > > > > > > > > remain > > > > > > > > compatible with older QEMU bit stream which does not do this. > > > > > > > > > > > > > > If this is an existing bug, we could add a Fixes. > > > > > > > > > > > > It is an existing issue but only uncovered by this patch set. > > > > > > > > > > > > As far as I can see it was always there, so it would need some > > > > > > thought where to point that Fixes tag. > > > > > > > > > > If there's no way to trigger a real functional bug anyway, it's also > > > > > ok we > > > > > omit the Fixes. > > > > > > > > > > > > Two pure questions.. > > > > > > > > > > > > > > - What is the correct way to terminate the TLS session > > > > > > > without this flag? > > > > > > > > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS > > > > > > example: > > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > > > > > > > > > > > > > - Why this is only needed by multifd sessions? > > > > > > > > > > > > What uncovered the issue was switching the load threads to using > > > > > > migrate_set_error() instead of their own result variable > > > > > > (load_threads_ret) which you had requested during the previous > > > > > > patch set version review: > > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > > > > > > > > > > > Turns out that the multifd receive code always returned > > > > > > error in the TLS case, just nothing was previously checking for > > > > > > that error presence. > > > > > > > > > > What I was curious is whether this issue also exists for the main > > > > > migration > > > > > channel when with tls, especially when e.g. multifd not enabled at > > > > > all. As > > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session. > > > > > > > > > > I think it's a good to find that we overlooked this before.. and IMHO > > > > > it's > > > > > always good we could fix this. > > > > > > > > > > Does it mean we need proper gnutls_bye() somewhere? > > > > > > > > > > If we need an explicit gnutls_bye(), then I wonder if that should be > > > > > done > > > > > on the main channel as well. > > > > > > > > That's a good question and looking at the code qemu_loadvm_state_main() > > > > exits > > > > on receiving "QEMU_VM_EOF" section (that's different from receiving > > > > socket EOF) > > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with > > > > explicit size > > > > in qemu_loadvm_state() - so still not until channel EOF. > > > > > > I had a closer look, I do feel like such pre-mature termination is caused > > > by explicit shutdown()s of the iochannels, looks like that can cause issue > > > even after everything is sent. Then I noticed indeed multifd sender > > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we > > > don't do that for the main channel. Maybe that is a major difference. > > > > > > Now I wonder whether we should shutdown() the channel at all if migration > > > succeeded, because looks like it can cause tl
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote: > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > From: "Maciej S. Szmigiero" > > > > Multifd send channels are terminated by calling > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > multifd_send_terminate_threads(), which in the TLS case essentially > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > Unfortunately, this does not terminate the TLS session properly and > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > The only reason why this wasn't causing migration failures is because > > the current migration code apparently does not check for migration > > error being set after the end of the multifd receive process. > > > > However, this will change soon so the multifd receive code has to be > > prepared to not return an error on such premature TLS session EOF. > > Use the newly introduced QIOChannelTLS method for that. > > > > It's worth noting that even if the sender were to be changed to terminate > > the TLS connection properly the receive side still needs to remain > > compatible with older QEMU bit stream which does not do this. > > If this is an existing bug, we could add a Fixes. > > Two pure questions.. > > - What is the correct way to terminate the TLS session without this flag? > > - Why this is only needed by multifd sessions? Graceful TLS termination (via gnutls_bye()) should only be important to security if the QEMU protocol in question does not know how much data it is expecting to recieve. ie it cannot otherwise distinguish between an expected EOF, and a premature EOF triggered by an attacker. If the migration protocol has sufficient info to know when a chanel is expected to see EOF, then we should stop trying to read from the TLS channel before seeing the underlying EOF. Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that migration will still fail corretly in the case of a malicious attack causing premature termination. If there's a risk that migration may succeed, but with incomplete data, then we would need the full gnutls_bye dance. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote: > On 3.02.2025 23:56, Peter Xu wrote: > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: > > > On 3.02.2025 21:20, Peter Xu wrote: > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > > > > > On 3.02.2025 19:20, Peter Xu wrote: > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > > > > > From: "Maciej S. Szmigiero" > > > > > > > > > > > > > > Multifd send channels are terminated by calling > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > > > > multifd_send_terminate_threads(), which in the TLS case > > > > > > > essentially > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > > > > > > > Unfortunately, this does not terminate the TLS session properly > > > > > > > and > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION > > > > > > > error. > > > > > > > > > > > > > > The only reason why this wasn't causing migration failures is > > > > > > > because > > > > > > > the current migration code apparently does not check for migration > > > > > > > error being set after the end of the multifd receive process. > > > > > > > > > > > > > > However, this will change soon so the multifd receive code has to > > > > > > > be > > > > > > > prepared to not return an error on such premature TLS session EOF. > > > > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > > > > terminate > > > > > > > the TLS connection properly the receive side still needs to remain > > > > > > > compatible with older QEMU bit stream which does not do this. > > > > > > > > > > > > If this is an existing bug, we could add a Fixes. > > > > > > > > > > It is an existing issue but only uncovered by this patch set. > > > > > > > > > > As far as I can see it was always there, so it would need some > > > > > thought where to point that Fixes tag. > > > > > > > > If there's no way to trigger a real functional bug anyway, it's also ok > > > > we > > > > omit the Fixes. > > > > > > > > > > Two pure questions.. > > > > > > > > > > > > - What is the correct way to terminate the TLS session without > > > > > > this flag? > > > > > > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS > > > > > example: > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > > > > > > > > > > > - Why this is only needed by multifd sessions? > > > > > > > > > > What uncovered the issue was switching the load threads to using > > > > > migrate_set_error() instead of their own result variable > > > > > (load_threads_ret) which you had requested during the previous > > > > > patch set version review: > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > > > > > > > > > Turns out that the multifd receive code always returned > > > > > error in the TLS case, just nothing was previously checking for > > > > > that error presence. > > > > > > > > What I was curious is whether this issue also exists for the main > > > > migration > > > > channel when with tls, especially when e.g. multifd not enabled at all. > > > > As > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session. > > > > > > > > I think it's a good to find that we overlooked this before.. and IMHO > > > > it's > > > > always good we could fix this. > > > > > > > > Does it mean we need proper gnutls_bye() somewhere? > > > > > > > > If we need an explicit gnutls_bye(), then I wonder if that should be > > > > done > > > > on the main channel as well. > > > > > > That's a good question and looking at the code qemu_loadvm_state_main() > > > exits > > > on receiving "QEMU_VM_EOF" section (that's different from receiving > > > socket EOF) > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit > > > size > > > in qemu_loadvm_state() - so still not until channel EOF. > > > > I had a closer look, I do feel like such pre-mature termination is caused > > by explicit shutdown()s of the iochannels, looks like that can cause issue > > even after everything is sent. Then I noticed indeed multifd sender > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we > > don't do that for the main channel. Maybe that is a major difference. > > > > Now I wonder whether we should shutdown() the channel at all if migration > > succeeded, because looks like it can cause tls session to interrupt even if > > the shutdown() is done after sent everything, and if so it'll explain why > > you hit the issue with tls. > > > > > > > > Then I can't see anything else reading the channel until it is closed in > > > migration_incoming_state_destroy(). > > > > > > So most likely the main migration c
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 4.02.2025 16:00, Fabiano Rosas wrote: "Maciej S. Szmigiero" writes: On 3.02.2025 23:56, Peter Xu wrote: On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. I had a closer look, I do feel like such pre-mature termination is caused by explicit shutdown()s of the iochannels, looks like that can cause issue even after everything is sent. Then I noticed indeed multifd sender iochannels will get explicit shutdown()s since commit 077fbb5942, while we don't do that for the main channel. Maybe that is a major difference. Now I wonder whether we should shutdown() the channel at all if migration succeeded, because looks like it can cause tls session to interrupt even if the shutdown() is done after sent everything, and if so it'll explain why you hit the issue with tls. Then I can't see anything else reading the channel until it is closed in migration_incoming_state_destroy(). So most likely the main migration channel will never read far enough to reach that GNUTLS_E_PREMATURE_TERMINATION error. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? So basically have this patch extended to calling qio_channel_tls_set_premature_eof_okay() also on the main migration channel? If above theory can stand, then eof-okay could be a workaround papering over the real problem that we shouldn't always shutdown().. Could you have a look at below patch and see whether it can fix the problem you hit too, in replace of these two patches (including the previous iochannel change)? Unfortunately, the patch below does not fix the problem: qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated. qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated. I think that, even
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Mon, Feb 03, 2025 at 03:20:50PM -0500, Peter Xu wrote: > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > > On 3.02.2025 19:20, Peter Xu wrote: > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > > From: "Maciej S. Szmigiero" > > > > > > > > Multifd send channels are terminated by calling > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > multifd_send_terminate_threads(), which in the TLS case essentially > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > Unfortunately, this does not terminate the TLS session properly and > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > > > The only reason why this wasn't causing migration failures is because > > > > the current migration code apparently does not check for migration > > > > error being set after the end of the multifd receive process. > > > > > > > > However, this will change soon so the multifd receive code has to be > > > > prepared to not return an error on such premature TLS session EOF. > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > terminate > > > > the TLS connection properly the receive side still needs to remain > > > > compatible with older QEMU bit stream which does not do this. > > > > > > If this is an existing bug, we could add a Fixes. > > > > It is an existing issue but only uncovered by this patch set. > > > > As far as I can see it was always there, so it would need some > > thought where to point that Fixes tag. > > If there's no way to trigger a real functional bug anyway, it's also ok we > omit the Fixes. > > > > Two pure questions.. > > > > > >- What is the correct way to terminate the TLS session without this > > > flag? > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example: > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > > > > >- Why this is only needed by multifd sessions? > > > > What uncovered the issue was switching the load threads to using > > migrate_set_error() instead of their own result variable > > (load_threads_ret) which you had requested during the previous > > patch set version review: > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > > > Turns out that the multifd receive code always returned > > error in the TLS case, just nothing was previously checking for > > that error presence. > > What I was curious is whether this issue also exists for the main migration > channel when with tls, especially when e.g. multifd not enabled at all. As > I don't see anywhere that qemu uses gnutls_bye() for any tls session. We've been lazy and avoided using gnutls_bye because it adds a bunch more complexity to the shutdown sequence, and premature shutdown from an malicious attack would be expected to cause the QEMU protocol in the TLS channel to fail anyway. > Does it mean we need proper gnutls_bye() somewhere? Depends if the protocol we run over TLS can identify premature termination itself, or whether it relies on TLS to identify it. With regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
"Maciej S. Szmigiero" writes: > On 3.02.2025 23:56, Peter Xu wrote: >> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >>> On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > On 3.02.2025 19:20, Peter Xu wrote: >> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: >>> From: "Maciej S. Szmigiero" >>> >>> Multifd send channels are terminated by calling >>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >>> multifd_send_terminate_threads(), which in the TLS case essentially >>> calls shutdown(SHUT_RDWR) on the underlying raw socket. >>> >>> Unfortunately, this does not terminate the TLS session properly and >>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. >>> >>> The only reason why this wasn't causing migration failures is because >>> the current migration code apparently does not check for migration >>> error being set after the end of the multifd receive process. >>> >>> However, this will change soon so the multifd receive code has to be >>> prepared to not return an error on such premature TLS session EOF. >>> Use the newly introduced QIOChannelTLS method for that. >>> >>> It's worth noting that even if the sender were to be changed to >>> terminate >>> the TLS connection properly the receive side still needs to remain >>> compatible with older QEMU bit stream which does not do this. >> >> If this is an existing bug, we could add a Fixes. > > It is an existing issue but only uncovered by this patch set. > > As far as I can see it was always there, so it would need some > thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. >> Two pure questions.. >> >> - What is the correct way to terminate the TLS session without this >> flag? > > I guess one would need to call gnutls_bye() like in this GnuTLS example: > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > >> - Why this is only needed by multifd sessions? > > What uncovered the issue was switching the load threads to using > migrate_set_error() instead of their own result variable > (load_threads_ret) which you had requested during the previous > patch set version review: > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > Turns out that the multifd receive code always returned > error in the TLS case, just nothing was previously checking for > that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. >>> >>> That's a good question and looking at the code qemu_loadvm_state_main() >>> exits >>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket >>> EOF) >>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit >>> size >>> in qemu_loadvm_state() - so still not until channel EOF. >> >> I had a closer look, I do feel like such pre-mature termination is caused >> by explicit shutdown()s of the iochannels, looks like that can cause issue >> even after everything is sent. Then I noticed indeed multifd sender >> iochannels will get explicit shutdown()s since commit 077fbb5942, while we >> don't do that for the main channel. Maybe that is a major difference. >> >> Now I wonder whether we should shutdown() the channel at all if migration >> succeeded, because looks like it can cause tls session to interrupt even if >> the shutdown() is done after sent everything, and if so it'll explain why >> you hit the issue with tls. >> >>> >>> Then I can't see anything else reading the channel until it is closed in >>> migration_incoming_state_destroy(). >>> >>> So most likely the main migration channel will never read far enough to >>> reach that GNUTLS_E_PREMATURE_TERMINATION error. >>> If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? >>> >>> So basically have this patch extended to calling >>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel? >> >> If above theory can stand, then eof-okay could be a workaround papering >> over the re
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 3.02.2025 23:56, Peter Xu wrote: On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. I had a closer look, I do feel like such pre-mature termination is caused by explicit shutdown()s of the iochannels, looks like that can cause issue even after everything is sent. Then I noticed indeed multifd sender iochannels will get explicit shutdown()s since commit 077fbb5942, while we don't do that for the main channel. Maybe that is a major difference. Now I wonder whether we should shutdown() the channel at all if migration succeeded, because looks like it can cause tls session to interrupt even if the shutdown() is done after sent everything, and if so it'll explain why you hit the issue with tls. Then I can't see anything else reading the channel until it is closed in migration_incoming_state_destroy(). So most likely the main migration channel will never read far enough to reach that GNUTLS_E_PREMATURE_TERMINATION error. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? So basically have this patch extended to calling qio_channel_tls_set_premature_eof_okay() also on the main migration channel? If above theory can stand, then eof-okay could be a workaround papering over the real problem that we shouldn't always shutdown().. Could you have a look at below patch and see whether it can fix the problem you hit too, in replace of these two patches (including the previous iochannel change)? Unfortunately, the patch below does not fix the problem: qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated. qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was non-properly terminated. I think that, even in the absence of shutdown(), if the sender does not call gnutls_bye() the T
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
Peter Xu writes: > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: >> On 3.02.2025 21:20, Peter Xu wrote: >> > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: >> > > On 3.02.2025 19:20, Peter Xu wrote: >> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: >> > > > > From: "Maciej S. Szmigiero" >> > > > > >> > > > > Multifd send channels are terminated by calling >> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in >> > > > > multifd_send_terminate_threads(), which in the TLS case essentially >> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. >> > > > > >> > > > > Unfortunately, this does not terminate the TLS session properly and >> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. >> > > > > >> > > > > The only reason why this wasn't causing migration failures is because >> > > > > the current migration code apparently does not check for migration >> > > > > error being set after the end of the multifd receive process. >> > > > > >> > > > > However, this will change soon so the multifd receive code has to be >> > > > > prepared to not return an error on such premature TLS session EOF. >> > > > > Use the newly introduced QIOChannelTLS method for that. >> > > > > >> > > > > It's worth noting that even if the sender were to be changed to >> > > > > terminate >> > > > > the TLS connection properly the receive side still needs to remain >> > > > > compatible with older QEMU bit stream which does not do this. >> > > > >> > > > If this is an existing bug, we could add a Fixes. >> > > >> > > It is an existing issue but only uncovered by this patch set. >> > > >> > > As far as I can see it was always there, so it would need some >> > > thought where to point that Fixes tag. >> > >> > If there's no way to trigger a real functional bug anyway, it's also ok we >> > omit the Fixes. >> > >> > > > Two pure questions.. >> > > > >> > > > - What is the correct way to terminate the TLS session without >> > > > this flag? >> > > >> > > I guess one would need to call gnutls_bye() like in this GnuTLS example: >> > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 >> > > >> > > > - Why this is only needed by multifd sessions? >> > > >> > > What uncovered the issue was switching the load threads to using >> > > migrate_set_error() instead of their own result variable >> > > (load_threads_ret) which you had requested during the previous >> > > patch set version review: >> > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ >> > > >> > > Turns out that the multifd receive code always returned >> > > error in the TLS case, just nothing was previously checking for >> > > that error presence. >> > >> > What I was curious is whether this issue also exists for the main migration >> > channel when with tls, especially when e.g. multifd not enabled at all. As >> > I don't see anywhere that qemu uses gnutls_bye() for any tls session. >> > >> > I think it's a good to find that we overlooked this before.. and IMHO it's >> > always good we could fix this. >> > >> > Does it mean we need proper gnutls_bye() somewhere? >> > >> > If we need an explicit gnutls_bye(), then I wonder if that should be done >> > on the main channel as well. >> >> That's a good question and looking at the code qemu_loadvm_state_main() exits >> on receiving "QEMU_VM_EOF" section (that's different from receiving socket >> EOF) >> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit >> size >> in qemu_loadvm_state() - so still not until channel EOF. > > I had a closer look, I do feel like such pre-mature termination is caused > by explicit shutdown()s of the iochannels, looks like that can cause issue > even after everything is sent. Then I noticed indeed multifd sender > iochannels will get explicit shutdown()s since commit 077fbb5942, while we > don't do that for the main channel. Maybe that is a major difference. > > Now I wonder whether we should shutdown() the channel at all if migration > succeeded, because looks like it can cause tls session to interrupt even if > the shutdown() is done after sent everything, and if so it'll explain why > you hit the issue with tls. > >> >> Then I can't see anything else reading the channel until it is closed in >> migration_incoming_state_destroy(). >> >> So most likely the main migration channel will never read far enough to >> reach that GNUTLS_E_PREMATURE_TERMINATION error. >> >> > If we don't need gnutls_bye(), then should we always ignore pre-mature >> > termination of tls no matter if it's multifd or non-multifd channel (or >> > even a tls session that is not migration-related)? >> >> So basically have this patch extended to calling >> qio_channel_tls_set_premature_eof_okay() also on the main migration channel? > > If above theory can stand, then eof-okay could be a wor
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote: > On 3.02.2025 21:20, Peter Xu wrote: > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > > > On 3.02.2025 19:20, Peter Xu wrote: > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > > > From: "Maciej S. Szmigiero" > > > > > > > > > > Multifd send channels are terminated by calling > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > > > multifd_send_terminate_threads(), which in the TLS case essentially > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > > > > > Unfortunately, this does not terminate the TLS session properly and > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > > > > > The only reason why this wasn't causing migration failures is because > > > > > the current migration code apparently does not check for migration > > > > > error being set after the end of the multifd receive process. > > > > > > > > > > However, this will change soon so the multifd receive code has to be > > > > > prepared to not return an error on such premature TLS session EOF. > > > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > > > > > It's worth noting that even if the sender were to be changed to > > > > > terminate > > > > > the TLS connection properly the receive side still needs to remain > > > > > compatible with older QEMU bit stream which does not do this. > > > > > > > > If this is an existing bug, we could add a Fixes. > > > > > > It is an existing issue but only uncovered by this patch set. > > > > > > As far as I can see it was always there, so it would need some > > > thought where to point that Fixes tag. > > > > If there's no way to trigger a real functional bug anyway, it's also ok we > > omit the Fixes. > > > > > > Two pure questions.. > > > > > > > > - What is the correct way to terminate the TLS session without this > > > > flag? > > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS example: > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > > > > > > > - Why this is only needed by multifd sessions? > > > > > > What uncovered the issue was switching the load threads to using > > > migrate_set_error() instead of their own result variable > > > (load_threads_ret) which you had requested during the previous > > > patch set version review: > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > > > > > Turns out that the multifd receive code always returned > > > error in the TLS case, just nothing was previously checking for > > > that error presence. > > > > What I was curious is whether this issue also exists for the main migration > > channel when with tls, especially when e.g. multifd not enabled at all. As > > I don't see anywhere that qemu uses gnutls_bye() for any tls session. > > > > I think it's a good to find that we overlooked this before.. and IMHO it's > > always good we could fix this. > > > > Does it mean we need proper gnutls_bye() somewhere? > > > > If we need an explicit gnutls_bye(), then I wonder if that should be done > > on the main channel as well. > > That's a good question and looking at the code qemu_loadvm_state_main() exits > on receiving "QEMU_VM_EOF" section (that's different from receiving socket > EOF) > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size > in qemu_loadvm_state() - so still not until channel EOF. I had a closer look, I do feel like such pre-mature termination is caused by explicit shutdown()s of the iochannels, looks like that can cause issue even after everything is sent. Then I noticed indeed multifd sender iochannels will get explicit shutdown()s since commit 077fbb5942, while we don't do that for the main channel. Maybe that is a major difference. Now I wonder whether we should shutdown() the channel at all if migration succeeded, because looks like it can cause tls session to interrupt even if the shutdown() is done after sent everything, and if so it'll explain why you hit the issue with tls. > > Then I can't see anything else reading the channel until it is closed in > migration_incoming_state_destroy(). > > So most likely the main migration channel will never read far enough to > reach that GNUTLS_E_PREMATURE_TERMINATION error. > > > If we don't need gnutls_bye(), then should we always ignore pre-mature > > termination of tls no matter if it's multifd or non-multifd channel (or > > even a tls session that is not migration-related)? > > So basically have this patch extended to calling > qio_channel_tls_set_premature_eof_okay() also on the main migration channel? If above theory can stand, then eof-okay could be a workaround papering over the real problem that we shouldn't always shutdown().. Could you have a look at below patch and see whether it ca
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 3.02.2025 21:20, Peter Xu wrote: On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. That's a good question and looking at the code qemu_loadvm_state_main() exits on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF) and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size in qemu_loadvm_state() - so still not until channel EOF. Then I can't see anything else reading the channel until it is closed in migration_incoming_state_destroy(). So most likely the main migration channel will never read far enough to reach that GNUTLS_E_PREMATURE_TERMINATION error. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? So basically have this patch extended to calling qio_channel_tls_set_premature_eof_okay() also on the main migration channel? Thanks, Thanks, Maciej
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote: > On 3.02.2025 19:20, Peter Xu wrote: > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > > > From: "Maciej S. Szmigiero" > > > > > > Multifd send channels are terminated by calling > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > > > multifd_send_terminate_threads(), which in the TLS case essentially > > > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > > > > > Unfortunately, this does not terminate the TLS session properly and > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > > > > > The only reason why this wasn't causing migration failures is because > > > the current migration code apparently does not check for migration > > > error being set after the end of the multifd receive process. > > > > > > However, this will change soon so the multifd receive code has to be > > > prepared to not return an error on such premature TLS session EOF. > > > Use the newly introduced QIOChannelTLS method for that. > > > > > > It's worth noting that even if the sender were to be changed to terminate > > > the TLS connection properly the receive side still needs to remain > > > compatible with older QEMU bit stream which does not do this. > > > > If this is an existing bug, we could add a Fixes. > > It is an existing issue but only uncovered by this patch set. > > As far as I can see it was always there, so it would need some > thought where to point that Fixes tag. If there's no way to trigger a real functional bug anyway, it's also ok we omit the Fixes. > > Two pure questions.. > > > >- What is the correct way to terminate the TLS session without this flag? > > I guess one would need to call gnutls_bye() like in this GnuTLS example: > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 > > >- Why this is only needed by multifd sessions? > > What uncovered the issue was switching the load threads to using > migrate_set_error() instead of their own result variable > (load_threads_ret) which you had requested during the previous > patch set version review: > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ > > Turns out that the multifd receive code always returned > error in the TLS case, just nothing was previously checking for > that error presence. What I was curious is whether this issue also exists for the main migration channel when with tls, especially when e.g. multifd not enabled at all. As I don't see anywhere that qemu uses gnutls_bye() for any tls session. I think it's a good to find that we overlooked this before.. and IMHO it's always good we could fix this. Does it mean we need proper gnutls_bye() somewhere? If we need an explicit gnutls_bye(), then I wonder if that should be done on the main channel as well. If we don't need gnutls_bye(), then should we always ignore pre-mature termination of tls no matter if it's multifd or non-multifd channel (or even a tls session that is not migration-related)? Thanks, > > Another option would be to simply return to using > load_threads_ret like the previous versions did and not > experiment with touching global migration state because > as we can see other places can unintentionally break. > > If we go this route then these TLS EOF patches could be > dropped. > > > Thanks, > > > > Thanks, > Maciej > -- Peter Xu
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On 3.02.2025 19:20, Peter Xu wrote: On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: From: "Maciej S. Szmigiero" Multifd send channels are terminated by calling qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in multifd_send_terminate_threads(), which in the TLS case essentially calls shutdown(SHUT_RDWR) on the underlying raw socket. Unfortunately, this does not terminate the TLS session properly and the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. The only reason why this wasn't causing migration failures is because the current migration code apparently does not check for migration error being set after the end of the multifd receive process. However, this will change soon so the multifd receive code has to be prepared to not return an error on such premature TLS session EOF. Use the newly introduced QIOChannelTLS method for that. It's worth noting that even if the sender were to be changed to terminate the TLS connection properly the receive side still needs to remain compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. It is an existing issue but only uncovered by this patch set. As far as I can see it was always there, so it would need some thought where to point that Fixes tag. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? I guess one would need to call gnutls_bye() like in this GnuTLS example: https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102 - Why this is only needed by multifd sessions? What uncovered the issue was switching the load threads to using migrate_set_error() instead of their own result variable (load_threads_ret) which you had requested during the previous patch set version review: https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/ Turns out that the multifd receive code always returned error in the TLS case, just nothing was previously checking for that error presence. Another option would be to simply return to using load_threads_ret like the previous versions did and not experiment with touching global migration state because as we can see other places can unintentionally break. If we go this route then these TLS EOF patches could be dropped. Thanks, Thanks, Maciej
Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels
On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" > > Multifd send channels are terminated by calling > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in > multifd_send_terminate_threads(), which in the TLS case essentially > calls shutdown(SHUT_RDWR) on the underlying raw socket. > > Unfortunately, this does not terminate the TLS session properly and > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error. > > The only reason why this wasn't causing migration failures is because > the current migration code apparently does not check for migration > error being set after the end of the multifd receive process. > > However, this will change soon so the multifd receive code has to be > prepared to not return an error on such premature TLS session EOF. > Use the newly introduced QIOChannelTLS method for that. > > It's worth noting that even if the sender were to be changed to terminate > the TLS connection properly the receive side still needs to remain > compatible with older QEMU bit stream which does not do this. If this is an existing bug, we could add a Fixes. Two pure questions.. - What is the correct way to terminate the TLS session without this flag? - Why this is only needed by multifd sessions? Thanks, -- Peter Xu
