Prasad Pandit <[email protected]> writes:

> On Tue, 13 Jan 2026 at 01:15, Fabiano Rosas <[email protected]> wrote:
>> There are failures that happen _because_ we cancelled. As I've mentioned
>> somewhere else before, the cancellation is not "informed" to all threads
>> running migration code, there are some code paths that will simply fail
>> as a result of migration_cancel(). We need to allow cancelling to work
>> in a possibly stuck thread (such as a blocked recv in the return path),
>> but this means we end up calling qemu_file_shutdown indiscriminately.
>> In these cases, parts of the code would set FAILED, but that failure is
>> a result of cancelling. We've determined that migrate-cancel should
>> always lead to CANCELLED and a new migration should always be possible.
>
> * I see.
>
>> This is ok, call it an error and done.
>>
>> > OTOH, if we cancel while processing an error/failure, end user
>> > may not see that error because we report - migration was cancelled.
>> >
>>
>> This is interesting, I _think_ it wouldn't be possible to cancel while
>> handling an error due to BQL locked, the migrate-cancel wouldn't be
>> issued while migration_cleanup is ongoing. However, I don't think we ever
>> tested this scenario in particular. Maybe you could try to catch
>> something by modifying the /migration/cancel tests, if you're willing.
>
> * I have made a note of looking at it at a later time.
>
>> Aside from the QAPI states, there are some internal states we already
>> track with separate flags, e.g.:
>>
>> rp_thread_created, start_postcopy, migration_thread_running,
>> switchover_acked, postcopy_package_loaded, fault_thread_quit,
>> preempt_thread_status, load_threads_abort.
>>
>> A bit array could maybe cover all of these and more.
>>
>> ---
>>
>> You could send a PoC patch with your idea fixing this FAILING bug? We'd
>> need a trigger for migrate, set_caps, etc and the failed event.
>>
>> If that new patch doesn't get consensus then we merge this one and work
>> on a new design as time permits.
>
> * Considering the above wider coverage area, I think it is best to
> first fix the issue at hand and then move to this new change. For now
> I'll try to rebase my current patch on your v3: cleanup early
> connection code series. Once that is through, I'll take the states
> change patch. Hope that's okay.
>

Ok, go ahead.

> Thank you.
> ---
>   - Prasad

Reply via email to