Prasad Pandit <[email protected]> writes:

> Hello Fabiano,
>
> On Wed, 5 Mar 2025 at 19:26, Fabiano Rosas <[email protected]> wrote:
>> Note that none of this is out of the ordinary, you'll find such
>> discussions in any thread on this community. It may feel arbitrary to
>> you because that's tacit knowledge we gathered along the years.
>
> * I understand. I don't find it arbitrary.
>
>> We need an extra patch that reads:
>>
>>  migration: Refactor channel discovery mechanism
>>
>>  The various logical migration channels don't have a standardized way of
>>  advertising themselves and their connections may be seen out of order
>>  by the migration destination. When a new connection arrives, the
>>  incoming migration currently make use of heuristics to determine which
>>  channel it belongs to.
>>
>>  The next few patches will need to change how the multifd and postcopy
>>  capabilities interact and that affects the channel discovery heuristic.
>>
>>  Refactor the channel discovery heuristic to make it less opaque and
>>  simplify the subsequent patches.
>>
>>  <some description of the new code which might be pertinent>
>>  ---
>>
>> You'd move all of the channel discovery code into this patch. Some of it
>> will be unreacheable because multifd is not yet allowed with postcopy,
>> but that's fine. You can mention it on the commit message.
>
> Please see:
>     -> 
> https://privatebin.net/?dad6f052dd986f9f#FULnfrCV29NkQpvsQyvWuU4HdYjDwFbUPbDtvLro7mwi
>
> * Does this division look okay?
>

Yes.

>> About moving the code out of migration.c, it was a suggestion that
>> you're free to push back. Ideally, doing the work would be faster than
>> arguing against it on the mailing list. But that's fine.
>
> * Same here, I'm not against moving that code part to connection.c OR
> doing the work. My suggestion has been to do that movement in another
> series and not try to do everything in this one series.
>
>> About the hang in the test. It doesn't reproduce often, but once it
>> does, it hangs forever (although I haven't waited that long).
>
> * Okay, I'm not seeing it or able to reproduce it across 3 different
> machines. One is my laptop and the other 2 are servers wherein I'm
> testing migrations of guests with 64G/128G of RAM and guest dirtying
> memory to the tune of 68M/128M/256M bytes. I'll keep an eye on it if I
> find something.

Usually a loaded (or slow) machine is needed to reproduce multifd
synchronization issues. Sometimes running the test in a loop in parallel
with some other workload helps to uncover them. The CI also tends to
have slower machines that hit these problems.

Reply via email to