Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-07 Thread Fabiano Rosas
Peter Xu  writes:

> On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote:
>> Peter Xu  writes:
>> 
>> > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
>> >> > In any case we'd still need some kind of a compatibility behavior for
>> >> > the TLS bit stream emitted by older QEMU versions (which is always
>> >> > improperly terminated).
>> >> >
>> >> 
>> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra
>> >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
>> >> the destination and it gets stuck waiting for the
>> >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
>> >> closes the connection before dst reaches the extra recv().
>> >> 
>> >> I test migration both ways with 2 previous QEMU versions and the
>> >> gnutls_bye() series passes all tests. I also put an assert at
>> >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
>> >> MULTIFD_FLAG_EOS should behave the same.
>> >
>> > Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
>> > wonder if the same issue would hit too with 9.0 or older.
>> 
>> Good point. 9.0 indeed breaks.
>> 
>> >
>> > I'd confess I feel unreliable relying on the side effect of 637280aeb2,
>> > because fundamentally it works based on the fact that multifd threads need
>> > to be kicked out by the main load thread SYNC event on dest QEMU to avoid
>> > the readv() from going wrong.
>> >
>> 
>> We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is
>> a bug that 1d457daf86 fixed.
>> 
>> > What I'm not sure here is, is it sheer luck that the main channel SYNC will
>> > always arrive _before_ pre-mature terminations of the multifd channels?  It
>> > sounds like it could also happen when the multifd channels got its
>> > pre-mature termination early, before the main thread got the SYNC.
>> 
>> You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that
>> puts the recv thread in the "won't see the termination" state and that
>> is serialized:
>> 
>>SENDRECV
>>-+
>> 1  multifd_send_sync_main()
>> 2  pending_sync==true,
>> 3  send thread sends SYNC  recv thread gets SYNC
>> 4   recv gets stuck.
>> 5  multifd_send_shutdown() 
>> 6  shutdown()  multifd_recv_shutdown()
>>recv_terminate_threads()
>>recv exits without recv()
>> 
>> In other words, RECV would need to see the shutdown (6) before the SYNC
>> (3), which I don't think it possible.
>
> Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot
> to push the per-channel SYNC.  I got it the other way round.  Yeah if data
> is always ordered with shutdown() effect on recv then it seems in order.
>
>> 
>> >
>> > Maybe we still need a compat property at the end..
>> 
>> This is actually similar to preempt_pre_7_2, what about:
>> 
>> /*
>>  * This variable only makes sense when set on the machine that is
>>  * the destination of a multifd migration with TLS enabled. It
>>  * affects the behavior of the last send->recv iteration with
>>  * regards to termination of the TLS session. Defaults to true.
>>  *
>>  * When set:
>>  *
>>  * - the destination QEMU instance can expect to never get a
>>  *   GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error
>>  *   message: "The TLS connection was non-properly terminated".
>>  *
>>  * When clear:
>>  *
>>  * - the destination QEMU instance can expect to see a
>>  *   GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel
>>  *   whenever the last recv() call of that channel happens after
>>  *   the source QEMU instance has already issued shutdown() on the
>>  *   channel. This is affected by (at least) commits 637280aeb2
>>  *   and 1d457daf86.
>
> If we want to reference them after all, we could use another sentence to
> describe the effects:
>
>*   Commit 637280aeb2 (since 9.1) introduced a side effect to cause
>*   pre-mature termination not happen, while commit 1d457daf86
>*   (since 10.0) can unexpectedly re-expose the pre-mature
>*   termination issue.
>

I'll add this.

>>  *
>>  * NOTE: Regardless of the state of this option, a premature
>>  * termination of the TLS connection might happen due to error at
>>  * any moment prior to the last send->recv iteration.
>>  */
>> bool multifd_clean_tls_termination;
>> 
>> And I think the more straight-forward implementation is to incorporate
>> Maciej's premature_ok patches (in some form), otherwise that option will
>> have to take effect on the QIOChannel which is a layering violation.
>
> If we take Dan's comment into account:
>
> https://lore.kernel.org/r/[email protected]
>
> It means whenever multifd recv thread invoke

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-07 Thread Peter Xu
On Fri, Feb 07, 2025 at 10:17:19AM -0300, Fabiano Rosas wrote:
> Peter Xu  writes:
> 
> > On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
> >> > In any case we'd still need some kind of a compatibility behavior for
> >> > the TLS bit stream emitted by older QEMU versions (which is always
> >> > improperly terminated).
> >> >
> >> 
> >> There is no compat issue. For <= 9.2, QEMU is still doing an extra
> >> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
> >> the destination and it gets stuck waiting for the
> >> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
> >> closes the connection before dst reaches the extra recv().
> >> 
> >> I test migration both ways with 2 previous QEMU versions and the
> >> gnutls_bye() series passes all tests. I also put an assert at
> >> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
> >> MULTIFD_FLAG_EOS should behave the same.
> >
> > Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
> > wonder if the same issue would hit too with 9.0 or older.
> 
> Good point. 9.0 indeed breaks.
> 
> >
> > I'd confess I feel unreliable relying on the side effect of 637280aeb2,
> > because fundamentally it works based on the fact that multifd threads need
> > to be kicked out by the main load thread SYNC event on dest QEMU to avoid
> > the readv() from going wrong.
> >
> 
> We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is
> a bug that 1d457daf86 fixed.
> 
> > What I'm not sure here is, is it sheer luck that the main channel SYNC will
> > always arrive _before_ pre-mature terminations of the multifd channels?  It
> > sounds like it could also happen when the multifd channels got its
> > pre-mature termination early, before the main thread got the SYNC.
> 
> You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that
> puts the recv thread in the "won't see the termination" state and that
> is serialized:
> 
>SENDRECV
>-+
> 1  multifd_send_sync_main()
> 2  pending_sync==true,
> 3  send thread sends SYNC  recv thread gets SYNC
> 4   recv gets stuck.
> 5  multifd_send_shutdown() 
> 6  shutdown()  multifd_recv_shutdown()
>recv_terminate_threads()
>recv exits without recv()
> 
> In other words, RECV would need to see the shutdown (6) before the SYNC
> (3), which I don't think it possible.

Ah yeah, I somehow remembered we sent a SYNC in the main channel but forgot
to push the per-channel SYNC.  I got it the other way round.  Yeah if data
is always ordered with shutdown() effect on recv then it seems in order.

> 
> >
> > Maybe we still need a compat property at the end..
> 
> This is actually similar to preempt_pre_7_2, what about:
> 
> /*
>  * This variable only makes sense when set on the machine that is
>  * the destination of a multifd migration with TLS enabled. It
>  * affects the behavior of the last send->recv iteration with
>  * regards to termination of the TLS session. Defaults to true.
>  *
>  * When set:
>  *
>  * - the destination QEMU instance can expect to never get a
>  *   GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error
>  *   message: "The TLS connection was non-properly terminated".
>  *
>  * When clear:
>  *
>  * - the destination QEMU instance can expect to see a
>  *   GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel
>  *   whenever the last recv() call of that channel happens after
>  *   the source QEMU instance has already issued shutdown() on the
>  *   channel. This is affected by (at least) commits 637280aeb2
>  *   and 1d457daf86.

If we want to reference them after all, we could use another sentence to
describe the effects:

   *   Commit 637280aeb2 (since 9.1) introduced a side effect to cause
   *   pre-mature termination not happen, while commit 1d457daf86
   *   (since 10.0) can unexpectedly re-expose the pre-mature
   *   termination issue.

>  *
>  * NOTE: Regardless of the state of this option, a premature
>  * termination of the TLS connection might happen due to error at
>  * any moment prior to the last send->recv iteration.
>  */
> bool multifd_clean_tls_termination;
> 
> And I think the more straight-forward implementation is to incorporate
> Maciej's premature_ok patches (in some form), otherwise that option will
> have to take effect on the QIOChannel which is a layering violation.

If we take Dan's comment into account:

https://lore.kernel.org/r/[email protected]

It means whenever multifd recv thread invokes the iochannel API it will use
multifd_clean_tls_termination to decide QIO_CHANNEL_READ_RELAXED_EOF flag
to pass in.  I hope this is not layer violation, or I co

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-07 Thread Fabiano Rosas
Peter Xu  writes:

> On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
>> > In any case we'd still need some kind of a compatibility behavior for
>> > the TLS bit stream emitted by older QEMU versions (which is always
>> > improperly terminated).
>> >
>> 
>> There is no compat issue. For <= 9.2, QEMU is still doing an extra
>> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
>> the destination and it gets stuck waiting for the
>> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
>> closes the connection before dst reaches the extra recv().
>> 
>> I test migration both ways with 2 previous QEMU versions and the
>> gnutls_bye() series passes all tests. I also put an assert at
>> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
>> MULTIFD_FLAG_EOS should behave the same.
>
> Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
> wonder if the same issue would hit too with 9.0 or older.

Good point. 9.0 indeed breaks.

>
> I'd confess I feel unreliable relying on the side effect of 637280aeb2,
> because fundamentally it works based on the fact that multifd threads need
> to be kicked out by the main load thread SYNC event on dest QEMU to avoid
> the readv() from going wrong.
>

We're relying on the opposite: mutlifd_recv NOT getting kicked. Which is
a bug that 1d457daf86 fixed.

> What I'm not sure here is, is it sheer luck that the main channel SYNC will
> always arrive _before_ pre-mature terminations of the multifd channels?  It
> sounds like it could also happen when the multifd channels got its
> pre-mature termination early, before the main thread got the SYNC.

You lost me here, what main channel sync? Its the MULTIFD_FLAG_SYNC that
puts the recv thread in the "won't see the termination" state and that
is serialized:

   SENDRECV
   -+
1  multifd_send_sync_main()
2  pending_sync==true,
3  send thread sends SYNC  recv thread gets SYNC
4   recv gets stuck.
5  multifd_send_shutdown() 
6  shutdown()  multifd_recv_shutdown()
   recv_terminate_threads()
   recv exits without recv()

In other words, RECV would need to see the shutdown (6) before the SYNC
(3), which I don't think it possible.

>
> Maybe we still need a compat property at the end..

This is actually similar to preempt_pre_7_2, what about:

/*
 * This variable only makes sense when set on the machine that is
 * the destination of a multifd migration with TLS enabled. It
 * affects the behavior of the last send->recv iteration with
 * regards to termination of the TLS session. Defaults to true.
 *
 * When set:
 *
 * - the destination QEMU instance can expect to never get a
 *   GNUTLS_E_PREMATURE_TERMINATION error. Manifested as the error
 *   message: "The TLS connection was non-properly terminated".
 *
 * When clear:
 *
 * - the destination QEMU instance can expect to see a
 *   GNUTLS_E_PREMATURE_TERMINATION error in any multifd channel
 *   whenever the last recv() call of that channel happens after
 *   the source QEMU instance has already issued shutdown() on the
 *   channel. This is affected by (at least) commits 637280aeb2
 *   and 1d457daf86.
 *
 * NOTE: Regardless of the state of this option, a premature
 * termination of the TLS connection might happen due to error at
 * any moment prior to the last send->recv iteration.
 */
bool multifd_clean_tls_termination;

And I think the more straight-forward implementation is to incorporate
Maciej's premature_ok patches (in some form), otherwise that option will
have to take effect on the QIOChannel which is a layering violation.



Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Peter Xu
On Thu, Feb 06, 2025 at 02:32:12PM -0300, Fabiano Rosas wrote:
> > In any case we'd still need some kind of a compatibility behavior for
> > the TLS bit stream emitted by older QEMU versions (which is always
> > improperly terminated).
> >
> 
> There is no compat issue. For <= 9.2, QEMU is still doing an extra
> multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
> the destination and it gets stuck waiting for the
> RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
> closes the connection before dst reaches the extra recv().
> 
> I test migration both ways with 2 previous QEMU versions and the
> gnutls_bye() series passes all tests. I also put an assert at
> tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
> MULTIFD_FLAG_EOS should behave the same.

Which are the versions you tried?  As only 9.1 and 9.2 has 637280aeb2, so I
wonder if the same issue would hit too with 9.0 or older.

I'd confess I feel unreliable relying on the side effect of 637280aeb2,
because fundamentally it works based on the fact that multifd threads need
to be kicked out by the main load thread SYNC event on dest QEMU to avoid
the readv() from going wrong.

What I'm not sure here is, is it sheer luck that the main channel SYNC will
always arrive _before_ pre-mature terminations of the multifd channels?  It
sounds like it could also happen when the multifd channels got its
pre-mature termination early, before the main thread got the SYNC.

Maybe we still need a compat property at the end..

-- 
Peter Xu




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Maciej S. Szmigiero

On 6.02.2025 18:32, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


On 6.02.2025 16:20, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


(..)

Even if non-TLS behaved the same, why would we choose to port a bug to
the TLS implementation?

We could of course decide at this point to punt the problem forward and
when it shows up, we'd have to go implement gnutls_bye() to allow the
distinction between good-EOF/bad-EOF or maybe add an extra sync at the
end of migration to make sure the last recv() call is only started after
the source has already shutdown() the channel.


If we do some kind of a premature EOF detection then it should probably
work for the non-TLS case too (since that's probably by far the most
common use case).
So adding some MULTIFD_FLAG_EOS would make the most sense and would work
even with QIO_CHANNEL_READ_RELAXED_EOF being set.



Indeed, if MULTIFD_FLAG_EOS is not seen, the recv thread could treat any
EOF as an error. The question is whether we can add that without
disrupting multifd synchronization too much.


In any case we'd still need some kind of a compatibility behavior for
the TLS bit stream emitted by older QEMU versions (which is always
improperly terminated).



There is no compat issue. For <= 9.2, QEMU is still doing an extra
multifd_send_sync_main(), which results in an extra MULTIFD_FLAG_SYNC on
the destination and it gets stuck waiting for the
RAM_SAVE_FLAG_MULTIFD_FLUSH that never comes. Therefore the src always
closes the connection before dst reaches the extra recv().

I test migration both ways with 2 previous QEMU versions and the
gnutls_bye() series passes all tests. I also put an assert at
tlssession.c and never triggers for GNUTLS_E_PREMATURE_TERMINATION. The
MULTIFD_FLAG_EOS should behave the same.


If you are confident that properly terminating the TLS session by adding
gnutls_bye() is the way to go then I'm fine with this - I hope @Peter
and @Daniel are too.

It's now only matter of how soon you can have these gnutls_bye() patches
posted/merged since if I drop the premature_ok stuff the updated series
will depend on them for passing the TLS tests.

Thanks,
Maciej




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Fabiano Rosas
"Maciej S. Szmigiero"  writes:

> On 6.02.2025 16:20, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero"  writes:
>> 
>>> On 6.02.2025 15:13, Fabiano Rosas wrote:
 "Maciej S. Szmigiero"  writes:

> On 5.02.2025 21:42, Fabiano Rosas wrote:
>> Fabiano Rosas  writes:
>>
>>> Daniel P. Berrangé  writes:
>>>
 On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 23:56, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
 On 3.02.2025 21:20, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero 
> wrote:
>> On 3.02.2025 19:20, Peter Xu wrote:
>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero 
>>> wrote:
 From: "Maciej S. Szmigiero" 

 Multifd send channels are terminated by calling
 qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
 multifd_send_terminate_threads(), which in the TLS case 
 essentially
 calls shutdown(SHUT_RDWR) on the underlying raw socket.

 Unfortunately, this does not terminate the TLS session 
 properly and
 the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION 
 error.

 The only reason why this wasn't causing migration failures is 
 because
 the current migration code apparently does not check for 
 migration
 error being set after the end of the multifd receive process.

 However, this will change soon so the multifd receive code has 
 to be
 prepared to not return an error on such premature TLS session 
 EOF.
 Use the newly introduced QIOChannelTLS method for that.

 It's worth noting that even if the sender were to be changed 
 to terminate
 the TLS connection properly the receive side still needs to 
 remain
 compatible with older QEMU bit stream which does not do this.
>>>
>>> If this is an existing bug, we could add a Fixes.
>>
>> It is an existing issue but only uncovered by this patch set.
>>
>> As far as I can see it was always there, so it would need some
>> thought where to point that Fixes tag.
>
> If there's no way to trigger a real functional bug anyway, it's 
> also ok we
> omit the Fixes.
>
>>> Two pure questions..
>>>
>>> - What is the correct way to terminate the TLS session 
>>> without this flag?
>>
>> I guess one would need to call gnutls_bye() like in this GnuTLS 
>> example:
>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>
>>> - Why this is only needed by multifd sessions?
>>
>> What uncovered the issue was switching the load threads to using
>> migrate_set_error() instead of their own result variable
>> (load_threads_ret) which you had requested during the previous
>> patch set version review:
>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>
>> Turns out that the multifd receive code always returned
>> error in the TLS case, just nothing was previously checking for
>> that error presence.
>
> What I was curious is whether this issue also exists for the main 
> migration
> channel when with tls, especially when e.g. multifd not enabled 
> at all.  As
> I don't see anywhere that qemu uses gnutls_bye() for any tls 
> session.
>
> I think it's a good to find that we overlooked this before.. and 
> IMHO it's
> always good we could fix this.
>
> Does it mean we need proper gnutls_bye() somewhere?
>
> If we need an explicit gnutls_bye(), then I wonder if that should 
> be done
> on the main channel as well.

 That's a good question and looking at the code 
 qemu_loadvm_state_main() exits
 on receiving "QEMU_VM_EOF" section (that's different from 
 receiving socket EOF)
 and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
 explic

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Maciej S. Szmigiero

On 6.02.2025 16:20, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


On 6.02.2025 15:13, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


On 5.02.2025 21:42, Fabiano Rosas wrote:

Fabiano Rosas  writes:


Daniel P. Berrangé  writes:


On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:

On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 23:56, Peter Xu wrote:

On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 21:20, Peter Xu wrote:

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.


If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.


Two pure questions..

- What is the correct way to terminate the TLS session without this 
flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


- Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.


What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.


That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.


I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.



Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.


If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?


So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?


If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it can fix the problem
you hit too, in replace of these two patches (including the p

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Fabiano Rosas
"Maciej S. Szmigiero"  writes:

> On 6.02.2025 15:13, Fabiano Rosas wrote:
>> "Maciej S. Szmigiero"  writes:
>> 
>>> On 5.02.2025 21:42, Fabiano Rosas wrote:
 Fabiano Rosas  writes:

> Daniel P. Berrangé  writes:
>
>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
 On 3.02.2025 23:56, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 21:20, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
 On 3.02.2025 19:20, Peter Xu wrote:
> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero 
> wrote:
>> From: "Maciej S. Szmigiero" 
>>
>> Multifd send channels are terminated by calling
>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> multifd_send_terminate_threads(), which in the TLS case 
>> essentially
>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>
>> Unfortunately, this does not terminate the TLS session properly 
>> and
>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION 
>> error.
>>
>> The only reason why this wasn't causing migration failures is 
>> because
>> the current migration code apparently does not check for 
>> migration
>> error being set after the end of the multifd receive process.
>>
>> However, this will change soon so the multifd receive code has 
>> to be
>> prepared to not return an error on such premature TLS session 
>> EOF.
>> Use the newly introduced QIOChannelTLS method for that.
>>
>> It's worth noting that even if the sender were to be changed to 
>> terminate
>> the TLS connection properly the receive side still needs to 
>> remain
>> compatible with older QEMU bit stream which does not do this.
>
> If this is an existing bug, we could add a Fixes.

 It is an existing issue but only uncovered by this patch set.

 As far as I can see it was always there, so it would need some
 thought where to point that Fixes tag.
>>>
>>> If there's no way to trigger a real functional bug anyway, it's 
>>> also ok we
>>> omit the Fixes.
>>>
> Two pure questions..
>
>- What is the correct way to terminate the TLS session 
> without this flag?

 I guess one would need to call gnutls_bye() like in this GnuTLS 
 example:
 https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102

>- Why this is only needed by multifd sessions?

 What uncovered the issue was switching the load threads to using
 migrate_set_error() instead of their own result variable
 (load_threads_ret) which you had requested during the previous
 patch set version review:
 https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

 Turns out that the multifd receive code always returned
 error in the TLS case, just nothing was previously checking for
 that error presence.
>>>
>>> What I was curious is whether this issue also exists for the main 
>>> migration
>>> channel when with tls, especially when e.g. multifd not enabled at 
>>> all.  As
>>> I don't see anywhere that qemu uses gnutls_bye() for any tls 
>>> session.
>>>
>>> I think it's a good to find that we overlooked this before.. and 
>>> IMHO it's
>>> always good we could fix this.
>>>
>>> Does it mean we need proper gnutls_bye() somewhere?
>>>
>>> If we need an explicit gnutls_bye(), then I wonder if that should 
>>> be done
>>> on the main channel as well.
>>
>> That's a good question and looking at the code 
>> qemu_loadvm_state_main() exits
>> on receiving "QEMU_VM_EOF" section (that's different from receiving 
>> socket EOF)
>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
>> explicit size
>> in qemu_loadvm_state() - so still not until channel EOF.
>
> I had a closer look, I do feel like such pre-mature termination is 
> caused
> by explicit shutdown()s of the iochannels, looks like that can cause 
> issue
> even after every

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Maciej S. Szmigiero

On 6.02.2025 15:13, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


On 5.02.2025 21:42, Fabiano Rosas wrote:

Fabiano Rosas  writes:


Daniel P. Berrangé  writes:


On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:

On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 23:56, Peter Xu wrote:

On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 21:20, Peter Xu wrote:

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.


If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.


Two pure questions..

   - What is the correct way to terminate the TLS session without this flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


   - Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.


What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.


That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.


I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.



Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.


If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?


So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?


If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it can fix the problem
you hit too, in replace of these two patches (including the previous
iochannel change)?



Unfortunately, the patch below does not fix the 

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Fabiano Rosas
"Maciej S. Szmigiero"  writes:

> On 5.02.2025 21:42, Fabiano Rosas wrote:
>> Fabiano Rosas  writes:
>> 
>>> Daniel P. Berrangé  writes:
>>>
 On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 23:56, Peter Xu wrote:
>>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
 On 3.02.2025 21:20, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 19:20, Peter Xu wrote:
>>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
 From: "Maciej S. Szmigiero" 

 Multifd send channels are terminated by calling
 qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
 multifd_send_terminate_threads(), which in the TLS case essentially
 calls shutdown(SHUT_RDWR) on the underlying raw socket.

 Unfortunately, this does not terminate the TLS session properly and
 the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION 
 error.

 The only reason why this wasn't causing migration failures is 
 because
 the current migration code apparently does not check for migration
 error being set after the end of the multifd receive process.

 However, this will change soon so the multifd receive code has to 
 be
 prepared to not return an error on such premature TLS session EOF.
 Use the newly introduced QIOChannelTLS method for that.

 It's worth noting that even if the sender were to be changed to 
 terminate
 the TLS connection properly the receive side still needs to remain
 compatible with older QEMU bit stream which does not do this.
>>>
>>> If this is an existing bug, we could add a Fixes.
>>
>> It is an existing issue but only uncovered by this patch set.
>>
>> As far as I can see it was always there, so it would need some
>> thought where to point that Fixes tag.
>
> If there's no way to trigger a real functional bug anyway, it's also 
> ok we
> omit the Fixes.
>
>>> Two pure questions..
>>>
>>>   - What is the correct way to terminate the TLS session 
>>> without this flag?
>>
>> I guess one would need to call gnutls_bye() like in this GnuTLS 
>> example:
>> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>
>>>   - Why this is only needed by multifd sessions?
>>
>> What uncovered the issue was switching the load threads to using
>> migrate_set_error() instead of their own result variable
>> (load_threads_ret) which you had requested during the previous
>> patch set version review:
>> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>
>> Turns out that the multifd receive code always returned
>> error in the TLS case, just nothing was previously checking for
>> that error presence.
>
> What I was curious is whether this issue also exists for the main 
> migration
> channel when with tls, especially when e.g. multifd not enabled at 
> all.  As
> I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>
> I think it's a good to find that we overlooked this before.. and IMHO 
> it's
> always good we could fix this.
>
> Does it mean we need proper gnutls_bye() somewhere?
>
> If we need an explicit gnutls_bye(), then I wonder if that should be 
> done
> on the main channel as well.

 That's a good question and looking at the code 
 qemu_loadvm_state_main() exits
 on receiving "QEMU_VM_EOF" section (that's different from receiving 
 socket EOF)
 and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
 explicit size
 in qemu_loadvm_state() - so still not until channel EOF.
>>>
>>> I had a closer look, I do feel like such pre-mature termination is 
>>> caused
>>> by explicit shutdown()s of the iochannels, looks like that can cause 
>>> issue
>>> even after everything is sent.  Then I noticed indeed multifd sender
>>> iochannels will get explicit shutdown()s since commit 077fbb5942, while 
>>> we
>>> don't do that for the main channel.  Maybe that is a major difference.
>>>
>>> Now I wonder whether we should shutdown() the channel at all if 
>>> migration
>>> succeeded, because looks like it can cause tls session to i

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-06 Thread Fabiano Rosas
Peter Xu  writes:

> On Wed, Feb 05, 2025 at 05:42:37PM -0300, Fabiano Rosas wrote:
>> Fabiano Rosas  writes:
>> 
>> > Daniel P. Berrangé  writes:
>> >
>> >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>> >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>> >>> > On 3.02.2025 23:56, Peter Xu wrote:
>> >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> >>> > > > On 3.02.2025 21:20, Peter Xu wrote:
>> >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero 
>> >>> > > > > wrote:
>> >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
>> >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. 
>> >>> > > > > > > Szmigiero wrote:
>> >>> > > > > > > > From: "Maciej S. Szmigiero" 
>> >>> > > > > > > > 
>> >>> > > > > > > > Multifd send channels are terminated by calling
>> >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case 
>> >>> > > > > > > > essentially
>> >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> >>> > > > > > > > 
>> >>> > > > > > > > Unfortunately, this does not terminate the TLS session 
>> >>> > > > > > > > properly and
>> >>> > > > > > > > the receive side sees this as a 
>> >>> > > > > > > > GNUTLS_E_PREMATURE_TERMINATION error.
>> >>> > > > > > > > 
>> >>> > > > > > > > The only reason why this wasn't causing migration failures 
>> >>> > > > > > > > is because
>> >>> > > > > > > > the current migration code apparently does not check for 
>> >>> > > > > > > > migration
>> >>> > > > > > > > error being set after the end of the multifd receive 
>> >>> > > > > > > > process.
>> >>> > > > > > > > 
>> >>> > > > > > > > However, this will change soon so the multifd receive code 
>> >>> > > > > > > > has to be
>> >>> > > > > > > > prepared to not return an error on such premature TLS 
>> >>> > > > > > > > session EOF.
>> >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
>> >>> > > > > > > > 
>> >>> > > > > > > > It's worth noting that even if the sender were to be 
>> >>> > > > > > > > changed to terminate
>> >>> > > > > > > > the TLS connection properly the receive side still needs 
>> >>> > > > > > > > to remain
>> >>> > > > > > > > compatible with older QEMU bit stream which does not do 
>> >>> > > > > > > > this.
>> >>> > > > > > > 
>> >>> > > > > > > If this is an existing bug, we could add a Fixes.
>> >>> > > > > > 
>> >>> > > > > > It is an existing issue but only uncovered by this patch set.
>> >>> > > > > > 
>> >>> > > > > > As far as I can see it was always there, so it would need some
>> >>> > > > > > thought where to point that Fixes tag.
>> >>> > > > > 
>> >>> > > > > If there's no way to trigger a real functional bug anyway, it's 
>> >>> > > > > also ok we
>> >>> > > > > omit the Fixes.
>> >>> > > > > 
>> >>> > > > > > > Two pure questions..
>> >>> > > > > > > 
>> >>> > > > > > >  - What is the correct way to terminate the TLS session 
>> >>> > > > > > > without this flag?
>> >>> > > > > > 
>> >>> > > > > > I guess one would need to call gnutls_bye() like in this 
>> >>> > > > > > GnuTLS example:
>> >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>> >>> > > > > > 
>> >>> > > > > > >  - Why this is only needed by multifd sessions?
>> >>> > > > > > 
>> >>> > > > > > What uncovered the issue was switching the load threads to 
>> >>> > > > > > using
>> >>> > > > > > migrate_set_error() instead of their own result variable
>> >>> > > > > > (load_threads_ret) which you had requested during the previous
>> >>> > > > > > patch set version review:
>> >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>> >>> > > > > > 
>> >>> > > > > > Turns out that the multifd receive code always returned
>> >>> > > > > > error in the TLS case, just nothing was previously checking for
>> >>> > > > > > that error presence.
>> >>> > > > > 
>> >>> > > > > What I was curious is whether this issue also exists for the 
>> >>> > > > > main migration
>> >>> > > > > channel when with tls, especially when e.g. multifd not enabled 
>> >>> > > > > at all.  As
>> >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls 
>> >>> > > > > session.
>> >>> > > > > 
>> >>> > > > > I think it's a good to find that we overlooked this before.. and 
>> >>> > > > > IMHO it's
>> >>> > > > > always good we could fix this.
>> >>> > > > > 
>> >>> > > > > Does it mean we need proper gnutls_bye() somewhere?
>> >>> > > > > 
>> >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that 
>> >>> > > > > should be done
>> >>> > > > > on the main channel as well.
>> >>> > > > 
>> >>> > > > That's a good question and looking at the code 
>> >>> > > > qemu_loadvm_state_main() exits
>> >>> > > > on receiving "QEMU_VM_EOF" section (that's different from 
>> >>> > > > receiving socke

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-05 Thread Peter Xu
On Wed, Feb 05, 2025 at 05:42:37PM -0300, Fabiano Rosas wrote:
> Fabiano Rosas  writes:
> 
> > Daniel P. Berrangé  writes:
> >
> >> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
> >>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
> >>> > On 3.02.2025 23:56, Peter Xu wrote:
> >>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> >>> > > > On 3.02.2025 21:20, Peter Xu wrote:
> >>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero 
> >>> > > > > wrote:
> >>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
> >>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero 
> >>> > > > > > > wrote:
> >>> > > > > > > > From: "Maciej S. Szmigiero" 
> >>> > > > > > > > 
> >>> > > > > > > > Multifd send channels are terminated by calling
> >>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> >>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case 
> >>> > > > > > > > essentially
> >>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> >>> > > > > > > > 
> >>> > > > > > > > Unfortunately, this does not terminate the TLS session 
> >>> > > > > > > > properly and
> >>> > > > > > > > the receive side sees this as a 
> >>> > > > > > > > GNUTLS_E_PREMATURE_TERMINATION error.
> >>> > > > > > > > 
> >>> > > > > > > > The only reason why this wasn't causing migration failures 
> >>> > > > > > > > is because
> >>> > > > > > > > the current migration code apparently does not check for 
> >>> > > > > > > > migration
> >>> > > > > > > > error being set after the end of the multifd receive 
> >>> > > > > > > > process.
> >>> > > > > > > > 
> >>> > > > > > > > However, this will change soon so the multifd receive code 
> >>> > > > > > > > has to be
> >>> > > > > > > > prepared to not return an error on such premature TLS 
> >>> > > > > > > > session EOF.
> >>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
> >>> > > > > > > > 
> >>> > > > > > > > It's worth noting that even if the sender were to be 
> >>> > > > > > > > changed to terminate
> >>> > > > > > > > the TLS connection properly the receive side still needs to 
> >>> > > > > > > > remain
> >>> > > > > > > > compatible with older QEMU bit stream which does not do 
> >>> > > > > > > > this.
> >>> > > > > > > 
> >>> > > > > > > If this is an existing bug, we could add a Fixes.
> >>> > > > > > 
> >>> > > > > > It is an existing issue but only uncovered by this patch set.
> >>> > > > > > 
> >>> > > > > > As far as I can see it was always there, so it would need some
> >>> > > > > > thought where to point that Fixes tag.
> >>> > > > > 
> >>> > > > > If there's no way to trigger a real functional bug anyway, it's 
> >>> > > > > also ok we
> >>> > > > > omit the Fixes.
> >>> > > > > 
> >>> > > > > > > Two pure questions..
> >>> > > > > > > 
> >>> > > > > > >  - What is the correct way to terminate the TLS session 
> >>> > > > > > > without this flag?
> >>> > > > > > 
> >>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS 
> >>> > > > > > example:
> >>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> >>> > > > > > 
> >>> > > > > > >  - Why this is only needed by multifd sessions?
> >>> > > > > > 
> >>> > > > > > What uncovered the issue was switching the load threads to using
> >>> > > > > > migrate_set_error() instead of their own result variable
> >>> > > > > > (load_threads_ret) which you had requested during the previous
> >>> > > > > > patch set version review:
> >>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> >>> > > > > > 
> >>> > > > > > Turns out that the multifd receive code always returned
> >>> > > > > > error in the TLS case, just nothing was previously checking for
> >>> > > > > > that error presence.
> >>> > > > > 
> >>> > > > > What I was curious is whether this issue also exists for the main 
> >>> > > > > migration
> >>> > > > > channel when with tls, especially when e.g. multifd not enabled 
> >>> > > > > at all.  As
> >>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls 
> >>> > > > > session.
> >>> > > > > 
> >>> > > > > I think it's a good to find that we overlooked this before.. and 
> >>> > > > > IMHO it's
> >>> > > > > always good we could fix this.
> >>> > > > > 
> >>> > > > > Does it mean we need proper gnutls_bye() somewhere?
> >>> > > > > 
> >>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should 
> >>> > > > > be done
> >>> > > > > on the main channel as well.
> >>> > > > 
> >>> > > > That's a good question and looking at the code 
> >>> > > > qemu_loadvm_state_main() exits
> >>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving 
> >>> > > > socket EOF)
> >>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
> >>> > > > explicit size
> >>> > > > in qemu_loadvm_s

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-05 Thread Maciej S. Szmigiero

On 5.02.2025 21:42, Fabiano Rosas wrote:

Fabiano Rosas  writes:


Daniel P. Berrangé  writes:


On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:

On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 23:56, Peter Xu wrote:

On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 21:20, Peter Xu wrote:

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.


If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.


Two pure questions..

  - What is the correct way to terminate the TLS session without this flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


  - Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.


What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.


That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.


I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.



Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.


If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?


So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?


If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it can fix the problem
you hit too, in replace of these two patches (including the previous
iochannel change)?



Unfortunately, the patch below does not fix the problem:

qemu-system-x86_64: Cannot read from TLS channel: The TLS connectio

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-05 Thread Fabiano Rosas
Fabiano Rosas  writes:

> Daniel P. Berrangé  writes:
>
>> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>>> > On 3.02.2025 23:56, Peter Xu wrote:
>>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>> > > > On 3.02.2025 21:20, Peter Xu wrote:
>>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
>>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero 
>>> > > > > > > wrote:
>>> > > > > > > > From: "Maciej S. Szmigiero" 
>>> > > > > > > > 
>>> > > > > > > > Multifd send channels are terminated by calling
>>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case 
>>> > > > > > > > essentially
>>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>> > > > > > > > 
>>> > > > > > > > Unfortunately, this does not terminate the TLS session 
>>> > > > > > > > properly and
>>> > > > > > > > the receive side sees this as a 
>>> > > > > > > > GNUTLS_E_PREMATURE_TERMINATION error.
>>> > > > > > > > 
>>> > > > > > > > The only reason why this wasn't causing migration failures is 
>>> > > > > > > > because
>>> > > > > > > > the current migration code apparently does not check for 
>>> > > > > > > > migration
>>> > > > > > > > error being set after the end of the multifd receive process.
>>> > > > > > > > 
>>> > > > > > > > However, this will change soon so the multifd receive code 
>>> > > > > > > > has to be
>>> > > > > > > > prepared to not return an error on such premature TLS session 
>>> > > > > > > > EOF.
>>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
>>> > > > > > > > 
>>> > > > > > > > It's worth noting that even if the sender were to be changed 
>>> > > > > > > > to terminate
>>> > > > > > > > the TLS connection properly the receive side still needs to 
>>> > > > > > > > remain
>>> > > > > > > > compatible with older QEMU bit stream which does not do this.
>>> > > > > > > 
>>> > > > > > > If this is an existing bug, we could add a Fixes.
>>> > > > > > 
>>> > > > > > It is an existing issue but only uncovered by this patch set.
>>> > > > > > 
>>> > > > > > As far as I can see it was always there, so it would need some
>>> > > > > > thought where to point that Fixes tag.
>>> > > > > 
>>> > > > > If there's no way to trigger a real functional bug anyway, it's 
>>> > > > > also ok we
>>> > > > > omit the Fixes.
>>> > > > > 
>>> > > > > > > Two pure questions..
>>> > > > > > > 
>>> > > > > > >  - What is the correct way to terminate the TLS session 
>>> > > > > > > without this flag?
>>> > > > > > 
>>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS 
>>> > > > > > example:
>>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>>> > > > > > 
>>> > > > > > >  - Why this is only needed by multifd sessions?
>>> > > > > > 
>>> > > > > > What uncovered the issue was switching the load threads to using
>>> > > > > > migrate_set_error() instead of their own result variable
>>> > > > > > (load_threads_ret) which you had requested during the previous
>>> > > > > > patch set version review:
>>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>>> > > > > > 
>>> > > > > > Turns out that the multifd receive code always returned
>>> > > > > > error in the TLS case, just nothing was previously checking for
>>> > > > > > that error presence.
>>> > > > > 
>>> > > > > What I was curious is whether this issue also exists for the main 
>>> > > > > migration
>>> > > > > channel when with tls, especially when e.g. multifd not enabled at 
>>> > > > > all.  As
>>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls 
>>> > > > > session.
>>> > > > > 
>>> > > > > I think it's a good to find that we overlooked this before.. and 
>>> > > > > IMHO it's
>>> > > > > always good we could fix this.
>>> > > > > 
>>> > > > > Does it mean we need proper gnutls_bye() somewhere?
>>> > > > > 
>>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should 
>>> > > > > be done
>>> > > > > on the main channel as well.
>>> > > > 
>>> > > > That's a good question and looking at the code 
>>> > > > qemu_loadvm_state_main() exits
>>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving 
>>> > > > socket EOF)
>>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
>>> > > > explicit size
>>> > > > in qemu_loadvm_state() - so still not until channel EOF.
>>> > > 
>>> > > I had a closer look, I do feel like such pre-mature termination is 
>>> > > caused
>>> > > by explicit shutdown()s of the iochannels, looks like that can cause 
>>> > > issue
>>> > > even after everything is sent.  Then I noticed indeed multifd sender
>>> > > io

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-05 Thread Fabiano Rosas
Daniel P. Berrangé  writes:

> On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
>> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
>> > On 3.02.2025 23:56, Peter Xu wrote:
>> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> > > > On 3.02.2025 21:20, Peter Xu wrote:
>> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
>> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero 
>> > > > > > > wrote:
>> > > > > > > > From: "Maciej S. Szmigiero" 
>> > > > > > > > 
>> > > > > > > > Multifd send channels are terminated by calling
>> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> > > > > > > > multifd_send_terminate_threads(), which in the TLS case 
>> > > > > > > > essentially
>> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> > > > > > > > 
>> > > > > > > > Unfortunately, this does not terminate the TLS session 
>> > > > > > > > properly and
>> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION 
>> > > > > > > > error.
>> > > > > > > > 
>> > > > > > > > The only reason why this wasn't causing migration failures is 
>> > > > > > > > because
>> > > > > > > > the current migration code apparently does not check for 
>> > > > > > > > migration
>> > > > > > > > error being set after the end of the multifd receive process.
>> > > > > > > > 
>> > > > > > > > However, this will change soon so the multifd receive code has 
>> > > > > > > > to be
>> > > > > > > > prepared to not return an error on such premature TLS session 
>> > > > > > > > EOF.
>> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
>> > > > > > > > 
>> > > > > > > > It's worth noting that even if the sender were to be changed 
>> > > > > > > > to terminate
>> > > > > > > > the TLS connection properly the receive side still needs to 
>> > > > > > > > remain
>> > > > > > > > compatible with older QEMU bit stream which does not do this.
>> > > > > > > 
>> > > > > > > If this is an existing bug, we could add a Fixes.
>> > > > > > 
>> > > > > > It is an existing issue but only uncovered by this patch set.
>> > > > > > 
>> > > > > > As far as I can see it was always there, so it would need some
>> > > > > > thought where to point that Fixes tag.
>> > > > > 
>> > > > > If there's no way to trigger a real functional bug anyway, it's also 
>> > > > > ok we
>> > > > > omit the Fixes.
>> > > > > 
>> > > > > > > Two pure questions..
>> > > > > > > 
>> > > > > > >  - What is the correct way to terminate the TLS session 
>> > > > > > > without this flag?
>> > > > > > 
>> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS 
>> > > > > > example:
>> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>> > > > > > 
>> > > > > > >  - Why this is only needed by multifd sessions?
>> > > > > > 
>> > > > > > What uncovered the issue was switching the load threads to using
>> > > > > > migrate_set_error() instead of their own result variable
>> > > > > > (load_threads_ret) which you had requested during the previous
>> > > > > > patch set version review:
>> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>> > > > > > 
>> > > > > > Turns out that the multifd receive code always returned
>> > > > > > error in the TLS case, just nothing was previously checking for
>> > > > > > that error presence.
>> > > > > 
>> > > > > What I was curious is whether this issue also exists for the main 
>> > > > > migration
>> > > > > channel when with tls, especially when e.g. multifd not enabled at 
>> > > > > all.  As
>> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>> > > > > 
>> > > > > I think it's a good to find that we overlooked this before.. and 
>> > > > > IMHO it's
>> > > > > always good we could fix this.
>> > > > > 
>> > > > > Does it mean we need proper gnutls_bye() somewhere?
>> > > > > 
>> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be 
>> > > > > done
>> > > > > on the main channel as well.
>> > > > 
>> > > > That's a good question and looking at the code 
>> > > > qemu_loadvm_state_main() exits
>> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving 
>> > > > socket EOF)
>> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
>> > > > explicit size
>> > > > in qemu_loadvm_state() - so still not until channel EOF.
>> > > 
>> > > I had a closer look, I do feel like such pre-mature termination is caused
>> > > by explicit shutdown()s of the iochannels, looks like that can cause 
>> > > issue
>> > > even after everything is sent.  Then I noticed indeed multifd sender
>> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while 
>> > > we
>> > > don't do that for the main channel.  Maybe that is a major dif

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Maciej S. Szmigiero

On 4.02.2025 19:25, Fabiano Rosas wrote:

Peter Xu  writes:


On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote:

On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.

Two pure questions..

   - What is the correct way to terminate the TLS session without this flag?

   - Why this is only needed by multifd sessions?


Graceful TLS termination (via gnutls_bye()) should only be important to
security if the QEMU protocol in question does not know how much data it
is expecting to recieve. ie it cannot otherwise distinguish between an
expected EOF, and a premature EOF triggered by an attacker.

If the migration protocol has sufficient info to know when a chanel is
expected to see EOF, then we should stop trying to read from the TLS
channel before seeing the underlying EOF.

Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
migration will still fail corretly in the case of a malicious attack
causing premature termination.

If there's a risk that migration may succeed, but with incomplete data,
then we would need the full gnutls_bye dance.


IIUC that's not required for migration then, because migration should know
exactly how much data to receive, and migration should need to verify that
and fail if the received data didn't match the expectation along the way.
We also have QEMU_VM_EOF as the end mark of stream.


The migration overall can detect whether EOF should have been reached,
but multifd threads cannot. If one multifd channel experiences an issue
and sees a premature termination, but ignores it, then that's a hang in
QEMU because nothing provided the syncs needed (p->sem_sync, most
likely).


I think allowing premature TLS termination simply makes the TLS case
function the same as non-TLS case here if the source were to close the
multifd channel(s) early.


Aren't we just postponing a bug?



Thanks,
Maciej




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Fabiano Rosas
Peter Xu  writes:

> On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote:
>> On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
>> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> > > From: "Maciej S. Szmigiero" 
>> > > 
>> > > Multifd send channels are terminated by calling
>> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> > > multifd_send_terminate_threads(), which in the TLS case essentially
>> > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> > > 
>> > > Unfortunately, this does not terminate the TLS session properly and
>> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>> > > 
>> > > The only reason why this wasn't causing migration failures is because
>> > > the current migration code apparently does not check for migration
>> > > error being set after the end of the multifd receive process.
>> > > 
>> > > However, this will change soon so the multifd receive code has to be
>> > > prepared to not return an error on such premature TLS session EOF.
>> > > Use the newly introduced QIOChannelTLS method for that.
>> > > 
>> > > It's worth noting that even if the sender were to be changed to terminate
>> > > the TLS connection properly the receive side still needs to remain
>> > > compatible with older QEMU bit stream which does not do this.
>> > 
>> > If this is an existing bug, we could add a Fixes.
>> > 
>> > Two pure questions..
>> > 
>> >   - What is the correct way to terminate the TLS session without this flag?
>> > 
>> >   - Why this is only needed by multifd sessions?
>> 
>> Graceful TLS termination (via gnutls_bye()) should only be important to
>> security if the QEMU protocol in question does not know how much data it
>> is expecting to recieve. ie it cannot otherwise distinguish between an
>> expected EOF, and a premature EOF triggered by an attacker.
>> 
>> If the migration protocol has sufficient info to know when a chanel is
>> expected to see EOF, then we should stop trying to read from the TLS
>> channel before seeing the underlying EOF.
>> 
>> Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
>> migration will still fail corretly in the case of a malicious attack
>> causing premature termination.
>> 
>> If there's a risk that migration may succeed, but with incomplete data,
>> then we would need the full gnutls_bye dance.
>
> IIUC that's not required for migration then, because migration should know
> exactly how much data to receive, and migration should need to verify that
> and fail if the received data didn't match the expectation along the way.
> We also have QEMU_VM_EOF as the end mark of stream.

The migration overall can detect whether EOF should have been reached,
but multifd threads cannot. If one multifd channel experiences an issue
and sees a premature termination, but ignores it, then that's a hang in
QEMU because nothing provided the syncs needed (p->sem_sync, most
likely).

Aren't we just postponing a bug?

>
> Said that, are we sure any pre-mature termination will only happen after
> all data read in the receive buffer that was sent?
>
> To ask in another way: what happens if the source QEMU sends everything and
> shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
> data to read, and (2) a pre-mature terminatino of TLS session in a read()
> syscall.  Would (2) be reported even before (1), or the order guaranteed
> that read of the residue data in (1) always happen before (2) (considering
> dest QEMU can be slow sometime on consuming the network buffers)?
>
> Thanks,



Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Peter Xu
On Tue, Feb 04, 2025 at 04:12:15PM +, Daniel P. Berrangé wrote:
> On Tue, Feb 04, 2025 at 11:02:28AM -0500, Peter Xu wrote:
> > On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote:
> > > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" 
> > > > > 
> > > > > Multifd send channels are terminated by calling
> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > 
> > > > > Unfortunately, this does not terminate the TLS session properly and
> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > > 
> > > > > The only reason why this wasn't causing migration failures is because
> > > > > the current migration code apparently does not check for migration
> > > > > error being set after the end of the multifd receive process.
> > > > > 
> > > > > However, this will change soon so the multifd receive code has to be
> > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > 
> > > > > It's worth noting that even if the sender were to be changed to 
> > > > > terminate
> > > > > the TLS connection properly the receive side still needs to remain
> > > > > compatible with older QEMU bit stream which does not do this.
> > > > 
> > > > If this is an existing bug, we could add a Fixes.
> > > > 
> > > > Two pure questions..
> > > > 
> > > >   - What is the correct way to terminate the TLS session without this 
> > > > flag?
> > > > 
> > > >   - Why this is only needed by multifd sessions?
> > > 
> > > Graceful TLS termination (via gnutls_bye()) should only be important to
> > > security if the QEMU protocol in question does not know how much data it
> > > is expecting to recieve. ie it cannot otherwise distinguish between an
> > > expected EOF, and a premature EOF triggered by an attacker.
> > > 
> > > If the migration protocol has sufficient info to know when a chanel is
> > > expected to see EOF, then we should stop trying to read from the TLS
> > > channel before seeing the underlying EOF.
> > > 
> > > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
> > > migration will still fail corretly in the case of a malicious attack
> > > causing premature termination.
> > > 
> > > If there's a risk that migration may succeed, but with incomplete data,
> > > then we would need the full gnutls_bye dance.
> > 
> > IIUC that's not required for migration then, because migration should know
> > exactly how much data to receive, and migration should need to verify that
> > and fail if the received data didn't match the expectation along the way.
> > We also have QEMU_VM_EOF as the end mark of stream.
> > 
> > Said that, are we sure any pre-mature termination will only happen after
> > all data read in the receive buffer that was sent?
> > 
> > To ask in another way: what happens if the source QEMU sends everything and
> > shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
> > data to read, and (2) a pre-mature terminatino of TLS session in a read()
> > syscall.  Would (2) be reported even before (1), or the order guaranteed
> > that read of the residue data in (1) always happen before (2) (considering
> > dest QEMU can be slow sometime on consuming the network buffers)?
> 
> That's not logically possible.
> 
> In both (1) and (2) you are issuing a read() call to the TLS channel.
> 
> The first read call(s) consume all incoming data. Only once the underlying
> TCP socket read() returns 0, would GNUTLS  see that it hasn't got any
> TLS "bye" packet, and thus return GNUTLS_E_PREMATURE_TERMINATION from
> the layered TLS read(). IOW, if you see GNUTLS_E_PREMATURE_TERMINATION
> you know you have already read all received data off the socket.

That looks all OK then.  In that case we could set all migration TLS
sessions to ignore premature terminations.

Thanks,

-- 
Peter Xu




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Peter Xu
On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote:
> On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" 
> > > 
> > > Multifd send channels are terminated by calling
> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > 
> > > Unfortunately, this does not terminate the TLS session properly and
> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > 
> > > The only reason why this wasn't causing migration failures is because
> > > the current migration code apparently does not check for migration
> > > error being set after the end of the multifd receive process.
> > > 
> > > However, this will change soon so the multifd receive code has to be
> > > prepared to not return an error on such premature TLS session EOF.
> > > Use the newly introduced QIOChannelTLS method for that.
> > > 
> > > It's worth noting that even if the sender were to be changed to terminate
> > > the TLS connection properly the receive side still needs to remain
> > > compatible with older QEMU bit stream which does not do this.
> > 
> > If this is an existing bug, we could add a Fixes.
> > 
> > Two pure questions..
> > 
> >   - What is the correct way to terminate the TLS session without this flag?
> > 
> >   - Why this is only needed by multifd sessions?
> 
> Graceful TLS termination (via gnutls_bye()) should only be important to
> security if the QEMU protocol in question does not know how much data it
> is expecting to recieve. ie it cannot otherwise distinguish between an
> expected EOF, and a premature EOF triggered by an attacker.
> 
> If the migration protocol has sufficient info to know when a chanel is
> expected to see EOF, then we should stop trying to read from the TLS
> channel before seeing the underlying EOF.
> 
> Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
> migration will still fail corretly in the case of a malicious attack
> causing premature termination.
> 
> If there's a risk that migration may succeed, but with incomplete data,
> then we would need the full gnutls_bye dance.

IIUC that's not required for migration then, because migration should know
exactly how much data to receive, and migration should need to verify that
and fail if the received data didn't match the expectation along the way.
We also have QEMU_VM_EOF as the end mark of stream.

Said that, are we sure any pre-mature termination will only happen after
all data read in the receive buffer that was sent?

To ask in another way: what happens if the source QEMU sends everything and
shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
data to read, and (2) a pre-mature terminatino of TLS session in a read()
syscall.  Would (2) be reported even before (1), or the order guaranteed
that read of the residue data in (1) always happen before (2) (considering
dest QEMU can be slow sometime on consuming the network buffers)?

Thanks,

-- 
Peter Xu




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Daniel P . Berrangé
On Tue, Feb 04, 2025 at 11:02:28AM -0500, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:08:02PM +, Daniel P. Berrangé wrote:
> > On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > From: "Maciej S. Szmigiero" 
> > > > 
> > > > Multifd send channels are terminated by calling
> > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > 
> > > > Unfortunately, this does not terminate the TLS session properly and
> > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > 
> > > > The only reason why this wasn't causing migration failures is because
> > > > the current migration code apparently does not check for migration
> > > > error being set after the end of the multifd receive process.
> > > > 
> > > > However, this will change soon so the multifd receive code has to be
> > > > prepared to not return an error on such premature TLS session EOF.
> > > > Use the newly introduced QIOChannelTLS method for that.
> > > > 
> > > > It's worth noting that even if the sender were to be changed to 
> > > > terminate
> > > > the TLS connection properly the receive side still needs to remain
> > > > compatible with older QEMU bit stream which does not do this.
> > > 
> > > If this is an existing bug, we could add a Fixes.
> > > 
> > > Two pure questions..
> > > 
> > >   - What is the correct way to terminate the TLS session without this 
> > > flag?
> > > 
> > >   - Why this is only needed by multifd sessions?
> > 
> > Graceful TLS termination (via gnutls_bye()) should only be important to
> > security if the QEMU protocol in question does not know how much data it
> > is expecting to recieve. ie it cannot otherwise distinguish between an
> > expected EOF, and a premature EOF triggered by an attacker.
> > 
> > If the migration protocol has sufficient info to know when a chanel is
> > expected to see EOF, then we should stop trying to read from the TLS
> > channel before seeing the underlying EOF.
> > 
> > Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
> > migration will still fail corretly in the case of a malicious attack
> > causing premature termination.
> > 
> > If there's a risk that migration may succeed, but with incomplete data,
> > then we would need the full gnutls_bye dance.
> 
> IIUC that's not required for migration then, because migration should know
> exactly how much data to receive, and migration should need to verify that
> and fail if the received data didn't match the expectation along the way.
> We also have QEMU_VM_EOF as the end mark of stream.
> 
> Said that, are we sure any pre-mature termination will only happen after
> all data read in the receive buffer that was sent?
> 
> To ask in another way: what happens if the source QEMU sends everything and
> shutdown()/close() the channel, meanwhile the dest QEMU sees both (1) rest
> data to read, and (2) a pre-mature terminatino of TLS session in a read()
> syscall.  Would (2) be reported even before (1), or the order guaranteed
> that read of the residue data in (1) always happen before (2) (considering
> dest QEMU can be slow sometime on consuming the network buffers)?

That's not logically possible.

In both (1) and (2) you are issuing a read() call to the TLS channel.

The first read call(s) consume all incoming data. Only once the underlying
TCP socket read() returns 0, would GNUTLS  see that it hasn't got any
TLS "bye" packet, and thus return GNUTLS_E_PREMATURE_TERMINATION from
the layered TLS read(). IOW, if you see GNUTLS_E_PREMATURE_TERMINATION
you know you have already read all received data off the socket.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Daniel P . Berrangé
On Tue, Feb 04, 2025 at 10:31:31AM -0500, Peter Xu wrote:
> On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
> > On 3.02.2025 23:56, Peter Xu wrote:
> > > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> > > > On 3.02.2025 21:20, Peter Xu wrote:
> > > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > > > > > On 3.02.2025 19:20, Peter Xu wrote:
> > > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero 
> > > > > > > wrote:
> > > > > > > > From: "Maciej S. Szmigiero" 
> > > > > > > > 
> > > > > > > > Multifd send channels are terminated by calling
> > > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > > > > multifd_send_terminate_threads(), which in the TLS case 
> > > > > > > > essentially
> > > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > > > > 
> > > > > > > > Unfortunately, this does not terminate the TLS session properly 
> > > > > > > > and
> > > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION 
> > > > > > > > error.
> > > > > > > > 
> > > > > > > > The only reason why this wasn't causing migration failures is 
> > > > > > > > because
> > > > > > > > the current migration code apparently does not check for 
> > > > > > > > migration
> > > > > > > > error being set after the end of the multifd receive process.
> > > > > > > > 
> > > > > > > > However, this will change soon so the multifd receive code has 
> > > > > > > > to be
> > > > > > > > prepared to not return an error on such premature TLS session 
> > > > > > > > EOF.
> > > > > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > > > > 
> > > > > > > > It's worth noting that even if the sender were to be changed to 
> > > > > > > > terminate
> > > > > > > > the TLS connection properly the receive side still needs to 
> > > > > > > > remain
> > > > > > > > compatible with older QEMU bit stream which does not do this.
> > > > > > > 
> > > > > > > If this is an existing bug, we could add a Fixes.
> > > > > > 
> > > > > > It is an existing issue but only uncovered by this patch set.
> > > > > > 
> > > > > > As far as I can see it was always there, so it would need some
> > > > > > thought where to point that Fixes tag.
> > > > > 
> > > > > If there's no way to trigger a real functional bug anyway, it's also 
> > > > > ok we
> > > > > omit the Fixes.
> > > > > 
> > > > > > > Two pure questions..
> > > > > > > 
> > > > > > >  - What is the correct way to terminate the TLS session 
> > > > > > > without this flag?
> > > > > > 
> > > > > > I guess one would need to call gnutls_bye() like in this GnuTLS 
> > > > > > example:
> > > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > > > > > 
> > > > > > >  - Why this is only needed by multifd sessions?
> > > > > > 
> > > > > > What uncovered the issue was switching the load threads to using
> > > > > > migrate_set_error() instead of their own result variable
> > > > > > (load_threads_ret) which you had requested during the previous
> > > > > > patch set version review:
> > > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > > > > > 
> > > > > > Turns out that the multifd receive code always returned
> > > > > > error in the TLS case, just nothing was previously checking for
> > > > > > that error presence.
> > > > > 
> > > > > What I was curious is whether this issue also exists for the main 
> > > > > migration
> > > > > channel when with tls, especially when e.g. multifd not enabled at 
> > > > > all.  As
> > > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> > > > > 
> > > > > I think it's a good to find that we overlooked this before.. and IMHO 
> > > > > it's
> > > > > always good we could fix this.
> > > > > 
> > > > > Does it mean we need proper gnutls_bye() somewhere?
> > > > > 
> > > > > If we need an explicit gnutls_bye(), then I wonder if that should be 
> > > > > done
> > > > > on the main channel as well.
> > > > 
> > > > That's a good question and looking at the code qemu_loadvm_state_main() 
> > > > exits
> > > > on receiving "QEMU_VM_EOF" section (that's different from receiving 
> > > > socket EOF)
> > > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with 
> > > > explicit size
> > > > in qemu_loadvm_state() - so still not until channel EOF.
> > > 
> > > I had a closer look, I do feel like such pre-mature termination is caused
> > > by explicit shutdown()s of the iochannels, looks like that can cause issue
> > > even after everything is sent.  Then I noticed indeed multifd sender
> > > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> > > don't do that for the main channel.  Maybe that is a major difference.
> > > 
> > > Now I wonder whether we should shutdown() the channel at all if migration
> > > succeeded, because looks like it can cause tl

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Daniel P . Berrangé
On Mon, Feb 03, 2025 at 01:20:01PM -0500, Peter Xu wrote:
> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > From: "Maciej S. Szmigiero" 
> > 
> > Multifd send channels are terminated by calling
> > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > multifd_send_terminate_threads(), which in the TLS case essentially
> > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > 
> > Unfortunately, this does not terminate the TLS session properly and
> > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > 
> > The only reason why this wasn't causing migration failures is because
> > the current migration code apparently does not check for migration
> > error being set after the end of the multifd receive process.
> > 
> > However, this will change soon so the multifd receive code has to be
> > prepared to not return an error on such premature TLS session EOF.
> > Use the newly introduced QIOChannelTLS method for that.
> > 
> > It's worth noting that even if the sender were to be changed to terminate
> > the TLS connection properly the receive side still needs to remain
> > compatible with older QEMU bit stream which does not do this.
> 
> If this is an existing bug, we could add a Fixes.
> 
> Two pure questions..
> 
>   - What is the correct way to terminate the TLS session without this flag?
> 
>   - Why this is only needed by multifd sessions?

Graceful TLS termination (via gnutls_bye()) should only be important to
security if the QEMU protocol in question does not know how much data it
is expecting to recieve. ie it cannot otherwise distinguish between an
expected EOF, and a premature EOF triggered by an attacker.

If the migration protocol has sufficient info to know when a chanel is
expected to see EOF, then we should stop trying to read from the TLS
channel before seeing the underlying EOF.

Ignoring GNUTLS_E_PREMATURE_TERMINATION would be valid if we know that
migration will still fail corretly in the case of a malicious attack
causing premature termination.

If there's a risk that migration may succeed, but with incomplete data,
then we would need the full gnutls_bye dance.

With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Peter Xu
On Tue, Feb 04, 2025 at 03:39:00PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 23:56, Peter Xu wrote:
> > On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> > > On 3.02.2025 21:20, Peter Xu wrote:
> > > > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > > > > On 3.02.2025 19:20, Peter Xu wrote:
> > > > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > > > From: "Maciej S. Szmigiero" 
> > > > > > > 
> > > > > > > Multifd send channels are terminated by calling
> > > > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > > > multifd_send_terminate_threads(), which in the TLS case 
> > > > > > > essentially
> > > > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > > > 
> > > > > > > Unfortunately, this does not terminate the TLS session properly 
> > > > > > > and
> > > > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION 
> > > > > > > error.
> > > > > > > 
> > > > > > > The only reason why this wasn't causing migration failures is 
> > > > > > > because
> > > > > > > the current migration code apparently does not check for migration
> > > > > > > error being set after the end of the multifd receive process.
> > > > > > > 
> > > > > > > However, this will change soon so the multifd receive code has to 
> > > > > > > be
> > > > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > > > 
> > > > > > > It's worth noting that even if the sender were to be changed to 
> > > > > > > terminate
> > > > > > > the TLS connection properly the receive side still needs to remain
> > > > > > > compatible with older QEMU bit stream which does not do this.
> > > > > > 
> > > > > > If this is an existing bug, we could add a Fixes.
> > > > > 
> > > > > It is an existing issue but only uncovered by this patch set.
> > > > > 
> > > > > As far as I can see it was always there, so it would need some
> > > > > thought where to point that Fixes tag.
> > > > 
> > > > If there's no way to trigger a real functional bug anyway, it's also ok 
> > > > we
> > > > omit the Fixes.
> > > > 
> > > > > > Two pure questions..
> > > > > > 
> > > > > >  - What is the correct way to terminate the TLS session without 
> > > > > > this flag?
> > > > > 
> > > > > I guess one would need to call gnutls_bye() like in this GnuTLS 
> > > > > example:
> > > > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > > > > 
> > > > > >  - Why this is only needed by multifd sessions?
> > > > > 
> > > > > What uncovered the issue was switching the load threads to using
> > > > > migrate_set_error() instead of their own result variable
> > > > > (load_threads_ret) which you had requested during the previous
> > > > > patch set version review:
> > > > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > > > > 
> > > > > Turns out that the multifd receive code always returned
> > > > > error in the TLS case, just nothing was previously checking for
> > > > > that error presence.
> > > > 
> > > > What I was curious is whether this issue also exists for the main 
> > > > migration
> > > > channel when with tls, especially when e.g. multifd not enabled at all. 
> > > >  As
> > > > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> > > > 
> > > > I think it's a good to find that we overlooked this before.. and IMHO 
> > > > it's
> > > > always good we could fix this.
> > > > 
> > > > Does it mean we need proper gnutls_bye() somewhere?
> > > > 
> > > > If we need an explicit gnutls_bye(), then I wonder if that should be 
> > > > done
> > > > on the main channel as well.
> > > 
> > > That's a good question and looking at the code qemu_loadvm_state_main() 
> > > exits
> > > on receiving "QEMU_VM_EOF" section (that's different from receiving 
> > > socket EOF)
> > > and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit 
> > > size
> > > in qemu_loadvm_state() - so still not until channel EOF.
> > 
> > I had a closer look, I do feel like such pre-mature termination is caused
> > by explicit shutdown()s of the iochannels, looks like that can cause issue
> > even after everything is sent.  Then I noticed indeed multifd sender
> > iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> > don't do that for the main channel.  Maybe that is a major difference.
> > 
> > Now I wonder whether we should shutdown() the channel at all if migration
> > succeeded, because looks like it can cause tls session to interrupt even if
> > the shutdown() is done after sent everything, and if so it'll explain why
> > you hit the issue with tls.
> > 
> > > 
> > > Then I can't see anything else reading the channel until it is closed in
> > > migration_incoming_state_destroy().
> > > 
> > > So most likely the main migration c

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Maciej S. Szmigiero

On 4.02.2025 16:00, Fabiano Rosas wrote:

"Maciej S. Szmigiero"  writes:


On 3.02.2025 23:56, Peter Xu wrote:

On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 21:20, Peter Xu wrote:

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.


If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.


Two pure questions..

  - What is the correct way to terminate the TLS session without this flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


  - Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.


What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.


That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.


I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.



Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.


If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?


So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?


If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it can fix the problem
you hit too, in replace of these two patches (including the previous
iochannel change)?



Unfortunately, the patch below does not fix the problem:

qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was 
non-properly terminated.
qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was 
non-properly terminated.


I think that, even

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Daniel P . Berrangé
On Mon, Feb 03, 2025 at 03:20:50PM -0500, Peter Xu wrote:
> On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > On 3.02.2025 19:20, Peter Xu wrote:
> > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > From: "Maciej S. Szmigiero" 
> > > > 
> > > > Multifd send channels are terminated by calling
> > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > 
> > > > Unfortunately, this does not terminate the TLS session properly and
> > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > 
> > > > The only reason why this wasn't causing migration failures is because
> > > > the current migration code apparently does not check for migration
> > > > error being set after the end of the multifd receive process.
> > > > 
> > > > However, this will change soon so the multifd receive code has to be
> > > > prepared to not return an error on such premature TLS session EOF.
> > > > Use the newly introduced QIOChannelTLS method for that.
> > > > 
> > > > It's worth noting that even if the sender were to be changed to 
> > > > terminate
> > > > the TLS connection properly the receive side still needs to remain
> > > > compatible with older QEMU bit stream which does not do this.
> > > 
> > > If this is an existing bug, we could add a Fixes.
> > 
> > It is an existing issue but only uncovered by this patch set.
> > 
> > As far as I can see it was always there, so it would need some
> > thought where to point that Fixes tag.
> 
> If there's no way to trigger a real functional bug anyway, it's also ok we
> omit the Fixes.
> 
> > > Two pure questions..
> > > 
> > >- What is the correct way to terminate the TLS session without this 
> > > flag?
> > 
> > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > 
> > >- Why this is only needed by multifd sessions?
> > 
> > What uncovered the issue was switching the load threads to using
> > migrate_set_error() instead of their own result variable
> > (load_threads_ret) which you had requested during the previous
> > patch set version review:
> > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > 
> > Turns out that the multifd receive code always returned
> > error in the TLS case, just nothing was previously checking for
> > that error presence.
> 
> What I was curious is whether this issue also exists for the main migration
> channel when with tls, especially when e.g. multifd not enabled at all.  As
> I don't see anywhere that qemu uses gnutls_bye() for any tls session.

We've been lazy and avoided using gnutls_bye because it adds a
bunch more complexity to the shutdown sequence, and premature
shutdown from an malicious attack would be expected to cause
the QEMU protocol in the TLS channel to fail anyway.

> Does it mean we need proper gnutls_bye() somewhere?

Depends if the protocol we run over TLS can identify premature
termination itself, or whether it relies on TLS to identify
it.


With regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Fabiano Rosas
"Maciej S. Szmigiero"  writes:

> On 3.02.2025 23:56, Peter Xu wrote:
>> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>>> On 3.02.2025 21:20, Peter Xu wrote:
 On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 19:20, Peter Xu wrote:
>> On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>>> From: "Maciej S. Szmigiero" 
>>>
>>> Multifd send channels are terminated by calling
>>> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>>> multifd_send_terminate_threads(), which in the TLS case essentially
>>> calls shutdown(SHUT_RDWR) on the underlying raw socket.
>>>
>>> Unfortunately, this does not terminate the TLS session properly and
>>> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>>>
>>> The only reason why this wasn't causing migration failures is because
>>> the current migration code apparently does not check for migration
>>> error being set after the end of the multifd receive process.
>>>
>>> However, this will change soon so the multifd receive code has to be
>>> prepared to not return an error on such premature TLS session EOF.
>>> Use the newly introduced QIOChannelTLS method for that.
>>>
>>> It's worth noting that even if the sender were to be changed to 
>>> terminate
>>> the TLS connection properly the receive side still needs to remain
>>> compatible with older QEMU bit stream which does not do this.
>>
>> If this is an existing bug, we could add a Fixes.
>
> It is an existing issue but only uncovered by this patch set.
>
> As far as I can see it was always there, so it would need some
> thought where to point that Fixes tag.

 If there's no way to trigger a real functional bug anyway, it's also ok we
 omit the Fixes.

>> Two pure questions..
>>
>>  - What is the correct way to terminate the TLS session without this 
>> flag?
>
> I guess one would need to call gnutls_bye() like in this GnuTLS example:
> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>
>>  - Why this is only needed by multifd sessions?
>
> What uncovered the issue was switching the load threads to using
> migrate_set_error() instead of their own result variable
> (load_threads_ret) which you had requested during the previous
> patch set version review:
> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>
> Turns out that the multifd receive code always returned
> error in the TLS case, just nothing was previously checking for
> that error presence.

 What I was curious is whether this issue also exists for the main migration
 channel when with tls, especially when e.g. multifd not enabled at all.  As
 I don't see anywhere that qemu uses gnutls_bye() for any tls session.

 I think it's a good to find that we overlooked this before.. and IMHO it's
 always good we could fix this.

 Does it mean we need proper gnutls_bye() somewhere?

 If we need an explicit gnutls_bye(), then I wonder if that should be done
 on the main channel as well.
>>>
>>> That's a good question and looking at the code qemu_loadvm_state_main() 
>>> exits
>>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket 
>>> EOF)
>>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit 
>>> size
>>> in qemu_loadvm_state() - so still not until channel EOF.
>> 
>> I had a closer look, I do feel like such pre-mature termination is caused
>> by explicit shutdown()s of the iochannels, looks like that can cause issue
>> even after everything is sent.  Then I noticed indeed multifd sender
>> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
>> don't do that for the main channel.  Maybe that is a major difference.
>> 
>> Now I wonder whether we should shutdown() the channel at all if migration
>> succeeded, because looks like it can cause tls session to interrupt even if
>> the shutdown() is done after sent everything, and if so it'll explain why
>> you hit the issue with tls.
>> 
>>>
>>> Then I can't see anything else reading the channel until it is closed in
>>> migration_incoming_state_destroy().
>>>
>>> So most likely the main migration channel will never read far enough to
>>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>>>
 If we don't need gnutls_bye(), then should we always ignore pre-mature
 termination of tls no matter if it's multifd or non-multifd channel (or
 even a tls session that is not migration-related)?
>>>
>>> So basically have this patch extended to calling
>>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>> 
>> If above theory can stand, then eof-okay could be a workaround papering
>> over the re

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Maciej S. Szmigiero

On 3.02.2025 23:56, Peter Xu wrote:

On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 21:20, Peter Xu wrote:

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.


If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.


Two pure questions..

 - What is the correct way to terminate the TLS session without this flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


 - Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.


What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.


That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.


I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.



Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.


If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?


So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?


If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it can fix the problem
you hit too, in replace of these two patches (including the previous
iochannel change)?



Unfortunately, the patch below does not fix the problem:

qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was 
non-properly terminated.
qemu-system-x86_64: Cannot read from TLS channel: The TLS connection was 
non-properly terminated.


I think that, even in the absence of shutdown(), if the sender does not
call gnutls_bye() the T

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-04 Thread Fabiano Rosas
Peter Xu  writes:

> On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
>> On 3.02.2025 21:20, Peter Xu wrote:
>> > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
>> > > On 3.02.2025 19:20, Peter Xu wrote:
>> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
>> > > > > From: "Maciej S. Szmigiero" 
>> > > > > 
>> > > > > Multifd send channels are terminated by calling
>> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
>> > > > > multifd_send_terminate_threads(), which in the TLS case essentially
>> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
>> > > > > 
>> > > > > Unfortunately, this does not terminate the TLS session properly and
>> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
>> > > > > 
>> > > > > The only reason why this wasn't causing migration failures is because
>> > > > > the current migration code apparently does not check for migration
>> > > > > error being set after the end of the multifd receive process.
>> > > > > 
>> > > > > However, this will change soon so the multifd receive code has to be
>> > > > > prepared to not return an error on such premature TLS session EOF.
>> > > > > Use the newly introduced QIOChannelTLS method for that.
>> > > > > 
>> > > > > It's worth noting that even if the sender were to be changed to 
>> > > > > terminate
>> > > > > the TLS connection properly the receive side still needs to remain
>> > > > > compatible with older QEMU bit stream which does not do this.
>> > > > 
>> > > > If this is an existing bug, we could add a Fixes.
>> > > 
>> > > It is an existing issue but only uncovered by this patch set.
>> > > 
>> > > As far as I can see it was always there, so it would need some
>> > > thought where to point that Fixes tag.
>> > 
>> > If there's no way to trigger a real functional bug anyway, it's also ok we
>> > omit the Fixes.
>> > 
>> > > > Two pure questions..
>> > > > 
>> > > > - What is the correct way to terminate the TLS session without 
>> > > > this flag?
>> > > 
>> > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
>> > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
>> > > 
>> > > > - Why this is only needed by multifd sessions?
>> > > 
>> > > What uncovered the issue was switching the load threads to using
>> > > migrate_set_error() instead of their own result variable
>> > > (load_threads_ret) which you had requested during the previous
>> > > patch set version review:
>> > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
>> > > 
>> > > Turns out that the multifd receive code always returned
>> > > error in the TLS case, just nothing was previously checking for
>> > > that error presence.
>> > 
>> > What I was curious is whether this issue also exists for the main migration
>> > channel when with tls, especially when e.g. multifd not enabled at all.  As
>> > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
>> > 
>> > I think it's a good to find that we overlooked this before.. and IMHO it's
>> > always good we could fix this.
>> > 
>> > Does it mean we need proper gnutls_bye() somewhere?
>> > 
>> > If we need an explicit gnutls_bye(), then I wonder if that should be done
>> > on the main channel as well.
>> 
>> That's a good question and looking at the code qemu_loadvm_state_main() exits
>> on receiving "QEMU_VM_EOF" section (that's different from receiving socket 
>> EOF)
>> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit 
>> size
>> in qemu_loadvm_state() - so still not until channel EOF.
>
> I had a closer look, I do feel like such pre-mature termination is caused
> by explicit shutdown()s of the iochannels, looks like that can cause issue
> even after everything is sent.  Then I noticed indeed multifd sender
> iochannels will get explicit shutdown()s since commit 077fbb5942, while we
> don't do that for the main channel.  Maybe that is a major difference.
>
> Now I wonder whether we should shutdown() the channel at all if migration
> succeeded, because looks like it can cause tls session to interrupt even if
> the shutdown() is done after sent everything, and if so it'll explain why
> you hit the issue with tls.
>
>> 
>> Then I can't see anything else reading the channel until it is closed in
>> migration_incoming_state_destroy().
>> 
>> So most likely the main migration channel will never read far enough to
>> reach that GNUTLS_E_PREMATURE_TERMINATION error.
>> 
>> > If we don't need gnutls_bye(), then should we always ignore pre-mature
>> > termination of tls no matter if it's multifd or non-multifd channel (or
>> > even a tls session that is not migration-related)?
>> 
>> So basically have this patch extended to calling
>> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?
>
> If above theory can stand, then eof-okay could be a wor

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-03 Thread Peter Xu
On Mon, Feb 03, 2025 at 10:41:32PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 21:20, Peter Xu wrote:
> > On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> > > On 3.02.2025 19:20, Peter Xu wrote:
> > > > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > > > From: "Maciej S. Szmigiero" 
> > > > > 
> > > > > Multifd send channels are terminated by calling
> > > > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > > > 
> > > > > Unfortunately, this does not terminate the TLS session properly and
> > > > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > > > 
> > > > > The only reason why this wasn't causing migration failures is because
> > > > > the current migration code apparently does not check for migration
> > > > > error being set after the end of the multifd receive process.
> > > > > 
> > > > > However, this will change soon so the multifd receive code has to be
> > > > > prepared to not return an error on such premature TLS session EOF.
> > > > > Use the newly introduced QIOChannelTLS method for that.
> > > > > 
> > > > > It's worth noting that even if the sender were to be changed to 
> > > > > terminate
> > > > > the TLS connection properly the receive side still needs to remain
> > > > > compatible with older QEMU bit stream which does not do this.
> > > > 
> > > > If this is an existing bug, we could add a Fixes.
> > > 
> > > It is an existing issue but only uncovered by this patch set.
> > > 
> > > As far as I can see it was always there, so it would need some
> > > thought where to point that Fixes tag.
> > 
> > If there's no way to trigger a real functional bug anyway, it's also ok we
> > omit the Fixes.
> > 
> > > > Two pure questions..
> > > > 
> > > > - What is the correct way to terminate the TLS session without this 
> > > > flag?
> > > 
> > > I guess one would need to call gnutls_bye() like in this GnuTLS example:
> > > https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> > > 
> > > > - Why this is only needed by multifd sessions?
> > > 
> > > What uncovered the issue was switching the load threads to using
> > > migrate_set_error() instead of their own result variable
> > > (load_threads_ret) which you had requested during the previous
> > > patch set version review:
> > > https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> > > 
> > > Turns out that the multifd receive code always returned
> > > error in the TLS case, just nothing was previously checking for
> > > that error presence.
> > 
> > What I was curious is whether this issue also exists for the main migration
> > channel when with tls, especially when e.g. multifd not enabled at all.  As
> > I don't see anywhere that qemu uses gnutls_bye() for any tls session.
> > 
> > I think it's a good to find that we overlooked this before.. and IMHO it's
> > always good we could fix this.
> > 
> > Does it mean we need proper gnutls_bye() somewhere?
> > 
> > If we need an explicit gnutls_bye(), then I wonder if that should be done
> > on the main channel as well.
> 
> That's a good question and looking at the code qemu_loadvm_state_main() exits
> on receiving "QEMU_VM_EOF" section (that's different from receiving socket 
> EOF)
> and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
> in qemu_loadvm_state() - so still not until channel EOF.

I had a closer look, I do feel like such pre-mature termination is caused
by explicit shutdown()s of the iochannels, looks like that can cause issue
even after everything is sent.  Then I noticed indeed multifd sender
iochannels will get explicit shutdown()s since commit 077fbb5942, while we
don't do that for the main channel.  Maybe that is a major difference.

Now I wonder whether we should shutdown() the channel at all if migration
succeeded, because looks like it can cause tls session to interrupt even if
the shutdown() is done after sent everything, and if so it'll explain why
you hit the issue with tls.

> 
> Then I can't see anything else reading the channel until it is closed in
> migration_incoming_state_destroy().
> 
> So most likely the main migration channel will never read far enough to
> reach that GNUTLS_E_PREMATURE_TERMINATION error.
> 
> > If we don't need gnutls_bye(), then should we always ignore pre-mature
> > termination of tls no matter if it's multifd or non-multifd channel (or
> > even a tls session that is not migration-related)?
> 
> So basically have this patch extended to calling
> qio_channel_tls_set_premature_eof_okay() also on the main migration channel?

If above theory can stand, then eof-okay could be a workaround papering
over the real problem that we shouldn't always shutdown()..

Could you have a look at below patch and see whether it ca

Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-03 Thread Maciej S. Szmigiero

On 3.02.2025 21:20, Peter Xu wrote:

On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.


If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.


Two pure questions..

- What is the correct way to terminate the TLS session without this flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


- Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.


What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.


That's a good question and looking at the code qemu_loadvm_state_main() exits
on receiving "QEMU_VM_EOF" section (that's different from receiving socket EOF)
and then optionally "QEMU_VM_VMDESCRIPTION" section is read with explicit size
in qemu_loadvm_state() - so still not until channel EOF.

Then I can't see anything else reading the channel until it is closed in
migration_incoming_state_destroy().

So most likely the main migration channel will never read far enough to
reach that GNUTLS_E_PREMATURE_TERMINATION error.


If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?


So basically have this patch extended to calling
qio_channel_tls_set_premature_eof_okay() also on the main migration channel?


Thanks,


Thanks,
Maciej




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-03 Thread Peter Xu
On Mon, Feb 03, 2025 at 07:53:00PM +0100, Maciej S. Szmigiero wrote:
> On 3.02.2025 19:20, Peter Xu wrote:
> > On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> > > From: "Maciej S. Szmigiero" 
> > > 
> > > Multifd send channels are terminated by calling
> > > qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> > > multifd_send_terminate_threads(), which in the TLS case essentially
> > > calls shutdown(SHUT_RDWR) on the underlying raw socket.
> > > 
> > > Unfortunately, this does not terminate the TLS session properly and
> > > the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> > > 
> > > The only reason why this wasn't causing migration failures is because
> > > the current migration code apparently does not check for migration
> > > error being set after the end of the multifd receive process.
> > > 
> > > However, this will change soon so the multifd receive code has to be
> > > prepared to not return an error on such premature TLS session EOF.
> > > Use the newly introduced QIOChannelTLS method for that.
> > > 
> > > It's worth noting that even if the sender were to be changed to terminate
> > > the TLS connection properly the receive side still needs to remain
> > > compatible with older QEMU bit stream which does not do this.
> > 
> > If this is an existing bug, we could add a Fixes.
> 
> It is an existing issue but only uncovered by this patch set.
> 
> As far as I can see it was always there, so it would need some
> thought where to point that Fixes tag.

If there's no way to trigger a real functional bug anyway, it's also ok we
omit the Fixes.

> > Two pure questions..
> > 
> >- What is the correct way to terminate the TLS session without this flag?
> 
> I guess one would need to call gnutls_bye() like in this GnuTLS example:
> https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102
> 
> >- Why this is only needed by multifd sessions?
> 
> What uncovered the issue was switching the load threads to using
> migrate_set_error() instead of their own result variable
> (load_threads_ret) which you had requested during the previous
> patch set version review:
> https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/
> 
> Turns out that the multifd receive code always returned
> error in the TLS case, just nothing was previously checking for
> that error presence.

What I was curious is whether this issue also exists for the main migration
channel when with tls, especially when e.g. multifd not enabled at all.  As
I don't see anywhere that qemu uses gnutls_bye() for any tls session.

I think it's a good to find that we overlooked this before.. and IMHO it's
always good we could fix this.

Does it mean we need proper gnutls_bye() somewhere?

If we need an explicit gnutls_bye(), then I wonder if that should be done
on the main channel as well.

If we don't need gnutls_bye(), then should we always ignore pre-mature
termination of tls no matter if it's multifd or non-multifd channel (or
even a tls session that is not migration-related)?

Thanks,

> 
> Another option would be to simply return to using
> load_threads_ret like the previous versions did and not
> experiment with touching global migration state because
> as we can see other places can unintentionally break.
> 
> If we go this route then these TLS EOF patches could be
> dropped.
> 
> > Thanks,
> > 
> 
> Thanks,
> Maciej
> 

-- 
Peter Xu




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-03 Thread Maciej S. Szmigiero

On 3.02.2025 19:20, Peter Xu wrote:

On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" 

Multifd send channels are terminated by calling
qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
multifd_send_terminate_threads(), which in the TLS case essentially
calls shutdown(SHUT_RDWR) on the underlying raw socket.

Unfortunately, this does not terminate the TLS session properly and
the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.

The only reason why this wasn't causing migration failures is because
the current migration code apparently does not check for migration
error being set after the end of the multifd receive process.

However, this will change soon so the multifd receive code has to be
prepared to not return an error on such premature TLS session EOF.
Use the newly introduced QIOChannelTLS method for that.

It's worth noting that even if the sender were to be changed to terminate
the TLS connection properly the receive side still needs to remain
compatible with older QEMU bit stream which does not do this.


If this is an existing bug, we could add a Fixes.


It is an existing issue but only uncovered by this patch set.

As far as I can see it was always there, so it would need some
thought where to point that Fixes tag.
 

Two pure questions..

   - What is the correct way to terminate the TLS session without this flag?


I guess one would need to call gnutls_bye() like in this GnuTLS example:
https://gitlab.com/gnutls/gnutls/-/blob/2b8c3e4c71ad380bbbffb32e6003b34ecad596e3/doc/examples/ex-client-anon.c#L102


   - Why this is only needed by multifd sessions?


What uncovered the issue was switching the load threads to using
migrate_set_error() instead of their own result variable
(load_threads_ret) which you had requested during the previous
patch set version review:
https://lore.kernel.org/qemu-devel/Z1DbH5fwBaxtgrvH@x1n/

Turns out that the multifd receive code always returned
error in the TLS case, just nothing was previously checking for
that error presence.

Another option would be to simply return to using
load_threads_ret like the previous versions did and not
experiment with touching global migration state because
as we can see other places can unintentionally break.

If we go this route then these TLS EOF patches could be
dropped.


Thanks,



Thanks,
Maciej




Re: [PATCH v4 08/33] migration/multifd: Allow premature EOF on TLS incoming channels

2025-02-03 Thread Peter Xu
On Thu, Jan 30, 2025 at 11:08:29AM +0100, Maciej S. Szmigiero wrote:
> From: "Maciej S. Szmigiero" 
> 
> Multifd send channels are terminated by calling
> qio_channel_shutdown(QIO_CHANNEL_SHUTDOWN_BOTH) in
> multifd_send_terminate_threads(), which in the TLS case essentially
> calls shutdown(SHUT_RDWR) on the underlying raw socket.
> 
> Unfortunately, this does not terminate the TLS session properly and
> the receive side sees this as a GNUTLS_E_PREMATURE_TERMINATION error.
> 
> The only reason why this wasn't causing migration failures is because
> the current migration code apparently does not check for migration
> error being set after the end of the multifd receive process.
> 
> However, this will change soon so the multifd receive code has to be
> prepared to not return an error on such premature TLS session EOF.
> Use the newly introduced QIOChannelTLS method for that.
> 
> It's worth noting that even if the sender were to be changed to terminate
> the TLS connection properly the receive side still needs to remain
> compatible with older QEMU bit stream which does not do this.

If this is an existing bug, we could add a Fixes.

Two pure questions..

  - What is the correct way to terminate the TLS session without this flag?

  - Why this is only needed by multifd sessions?

Thanks,

-- 
Peter Xu