On Mon, 4 Nov 2024 at 23:19, Peter Xu <[email protected]> wrote:
> Precopy keeps sending data even during postcopy, that's the background
> stream (with/without preempt feature enabled). May need some amendment
> when repost here.
* Okay.
> > + if (channel == CH_POSTCOPY) {
> > + return false;
> > + }
>
> Please see my other reply, I think here it should never be POSTCOPY
> channel, because postcopy (aka, preempt) channel is only created after the
> main channel.. So I wonder whether this "if" will hit at all.
* It does. migration_ioc_process_incoming() is called for each
incoming connection. And every time it calls
migration_should_start_incoming() to check if it should proceed with
the migration or should wait for further connections, because multifd
does not start until all its connections are established.
* Similarly when the Postcopy channel is initiated towards the end of
Precopy migration, migration_should_start_incoming() gets called, and
returns 'false' because we don't want to start a new incoming
migration at that time. Earlier it was receiving boolean value
'default_channel' as parameter, which was set to false while accepting
'postcopy' connection via => default_channel = !mis->from_src_file;
> > +
> > /* Multifd doesn't start unless all channels are established */
> > if (migrate_multifd()) {
> > - return migration_has_all_channels();
> > - }
> > -
> > - /* Preempt channel only starts when the main channel is created */
> > - if (migrate_postcopy_preempt()) {
> > - return main_channel;
> > + return multifd_recv_all_channels_created();
>
> I think this is incorrect.. We should also need to check main channel is
> established before start incoming. The old code uses
> migration_has_all_channels() which checks that.
* Okay, will try to fix it better. But calling
migration_has_all_channels() after checking migrate_multifd() does not
seem/read right.
> I don't yet see how you manage the multifd threads, etc, on both sides. Or
> any logic to make sure multifd will properly flush the pages before
> postcopy starts. IOW, any guarantee that all the pages will only be
> installed using UFFDIO_COPY as long as vcpu starts on dest. Any comments?
* There are no changes related to that. As this series only tries to
enable the multifd and postcopy options together.
> The most complicated part of this work would be testing, to make sure it
> works in all previous cases, and that's majorly why we disabled it before:
> it was because it randomly crashes, but not always; fundamentally it's
> because when multifd was designed there wasn't enough consideration on
> working together with postcopy. We didn't clearly know what's missing at
> that time.
* Right. I have tested this series with the following migration
commands to confirm that migration works as expected and there were no
crash(es) observed.
===
[Precopy]
# virsh migrate --verbose --live --auto-converge f39vm
qemu+ssh://<destination-host-name>/system
[Precopy + Multifd]
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 02 \
f39vm qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 04 \
f39vm qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 08 \
f39vm qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 16 \
f39vm qemu+ssh://<destination-host-name>/system
[Postcopy]
# virsh migrate --verbose --live --auto-converge \
--postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
[Postcopy + Multifd]
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 02 \
--postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 04 \
--postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 08 \
--postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
# virsh migrate --verbose --live --auto-converge --parallel
--parallel-connections 16 \
--postcopy --postcopy-after-precopy f39vm
qemu+ssh://<destination-host-name>/system
===
> So we would definitely need to add test cases, just like whoever adds new
> features to migration, to make sure at least it works for existing multifd
> / postcopy test cases, but when both features enabled.
...
> It will add quite a few tests to run here, but I don't see a good way
> otherwise when we want to enable the two features, because it is indeed a
> challenge to enable the two major features together here.
>
* Thank you for the hints about the tests, will surely look into them
and try to learn about how to add new test cases.
Thank you.
---
- Prasad