Prasad Pandit <[email protected]> writes: > Hello Fabiano, > > On Wed, 5 Mar 2025 at 19:26, Fabiano Rosas <[email protected]> wrote: >> Note that none of this is out of the ordinary, you'll find such >> discussions in any thread on this community. It may feel arbitrary to >> you because that's tacit knowledge we gathered along the years. > > * I understand. I don't find it arbitrary. > >> We need an extra patch that reads: >> >> migration: Refactor channel discovery mechanism >> >> The various logical migration channels don't have a standardized way of >> advertising themselves and their connections may be seen out of order >> by the migration destination. When a new connection arrives, the >> incoming migration currently make use of heuristics to determine which >> channel it belongs to. >> >> The next few patches will need to change how the multifd and postcopy >> capabilities interact and that affects the channel discovery heuristic. >> >> Refactor the channel discovery heuristic to make it less opaque and >> simplify the subsequent patches. >> >> <some description of the new code which might be pertinent> >> --- >> >> You'd move all of the channel discovery code into this patch. Some of it >> will be unreacheable because multifd is not yet allowed with postcopy, >> but that's fine. You can mention it on the commit message. > > Please see: > -> > https://privatebin.net/?dad6f052dd986f9f#FULnfrCV29NkQpvsQyvWuU4HdYjDwFbUPbDtvLro7mwi > > * Does this division look okay? >
Yes. >> About moving the code out of migration.c, it was a suggestion that >> you're free to push back. Ideally, doing the work would be faster than >> arguing against it on the mailing list. But that's fine. > > * Same here, I'm not against moving that code part to connection.c OR > doing the work. My suggestion has been to do that movement in another > series and not try to do everything in this one series. > >> About the hang in the test. It doesn't reproduce often, but once it >> does, it hangs forever (although I haven't waited that long). > > * Okay, I'm not seeing it or able to reproduce it across 3 different > machines. One is my laptop and the other 2 are servers wherein I'm > testing migrations of guests with 64G/128G of RAM and guest dirtying > memory to the tune of 68M/128M/256M bytes. I'll keep an eye on it if I > find something. Usually a loaded (or slow) machine is needed to reproduce multifd synchronization issues. Sometimes running the test in a loop in parallel with some other workload helps to uncover them. The CI also tends to have slower machines that hit these problems.
