On 3.02.2025 21:20, Peter Xu wrote:
On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
On 3.02.2025 19:20, Peter Xu wrote:
On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.

If this is an existing bug, we could add a Fixes.

It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.

If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.

Two pure questions..

    - What is the correct way to terminate the TLS session without this flag?

I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102

    - Why this is only needed by multifd sessions?

What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.

What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.

That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.

Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.

If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?

So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?

Thanks,

Thanks,
Maciej


Reply via email to