Peter Xu <[email protected]> writes:

> On Wed, Sep 17, 2025 at 05:52:54PM -0300, Fabiano Rosas wrote:
>> Peter Xu <[email protected]> writes:
>> 
>> > We set CANCELLED very late, it means migration_has_failed() may not work
>> > correctly if it's invoked before updating CANCELLING to CANCELLED.
>> >
>> 
>> The prophecy is fulfilled.
>> 
>> https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency
>> 
>> I'm not sure I'm convinced, for instance, CANCELLING is part of
>> migration_is_running(), while FAILED is not. This doesn't seem
>> right. Another point is that CANCELLING is not a final state, so we're
>> prone to later need a migration_has_finished_failing_now() helper. =)
>
> Considering we only have two users so far, and the other user doesn't care
> about CANCELLING (while the multifd shutdown cares?), then I assume it's ok
> to treat CANCELLING to be "has failed"? :)  I didn't try to interpret "has
> failed" in English, but only for the sake of an universal helper that works
> for both places.
>
> Or maybe it can be is_failing() too?  I don't have a strong feeling.
>

I'm not nitipicking on language. I'm pointing out that CANCELLING is a
transitory state, i.e. from migrate_cancel() until migrate_cleanup(),
while FAILED is a terminal state, nothing happens after it.

But fine, I guess it's really only *my* assumptions being broken and not
the ones in the code.

>> 
>> My mental model is that CANCELLING is a transitional, ongoing state
>> where we shouldn't really be making assumptions. Once FAILED is reached,
>> then we're sure in which general state everything is.
>> 
>> How did you catch this? It was one of the cancel tests that failed? I
>> just noticed that multifd_send_shutdown() is called from
>> migration_cleanup() before it changes the state to CANCELLED. So current
>> code also has whatever issue you detected here.
>
> No test failed, it was only by code observation, mentioned below [1],
> exactly as you said.
>
> I just think when cancelling the tls sessions, we shouldn't dump the error
> messages anymore even if the bye failed.

Ok

> Or maybe we simply do not need to
> invoke migration_tls_channel_end() when CANCELLING / FAILED?  That's
> relevant to your ask on the cover letter, we can discuss there.
>
> This is very trivial.

Nah, let me review the patch properly, please.

> Let me know how you thinks.  I can also drop this
> patch when repost v3 but fix the postcopy warning first, which reliably
> reproduce now with qtest.
>
>> 
>> > Allow that state will make migration_has_failed() working as expected even
>> > if it's invoked slightly earlier.
>> >
>> > One current user is the multifd code for the TLS graceful termination,
>> > where it's before updating to CANCELLED.
>
> [1]
>
>> >
>> > Signed-off-by: Peter Xu <[email protected]>
>> > ---
>> >  migration/migration.c | 3 ++-
>> >  1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/migration/migration.c b/migration/migration.c
>> > index 7015c2b5e0..397917b1b3 100644
>> > --- a/migration/migration.c
>> > +++ b/migration/migration.c
>> > @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, 
>> > MigrationEventType type,
>> >  
>> >  bool migration_has_failed(MigrationState *s)
>> >  {
>> > -    return (s->state == MIGRATION_STATUS_CANCELLED ||
>> > +    return (s->state == MIGRATION_STATUS_CANCELLING ||
>> > +            s->state == MIGRATION_STATUS_CANCELLED ||
>> >              s->state == MIGRATION_STATUS_FAILED);
>> >  }
>> 

Reply via email to