Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-28 Thread Avihai Horon



On 28/02/2024 12:17, Peter Xu wrote:

External email: Use caution opening links or attachments


On Wed, Feb 28, 2024 at 11:39:52AM +0200, Avihai Horon wrote:

On 28/02/2024 5:04, Peter Xu wrote:

External email: Use caution opening links or attachments


On Wed, Feb 28, 2024 at 02:00:26AM +0200, Avihai Horon wrote:

On 27/02/2024 9:41, Peter Xu wrote:

External email: Use caution opening links or attachments


On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:

Currently, migration code serializes device data sending during pre-copy
iterative phase. As noted in the code comment, this is done to prevent
faster changing device from sending its data over and over.

Frankly speaking I don't understand the rational behind 90697be889 ("live
migration: Serialize vmstate saving in stage 2").  I don't even think I
noticed this logic before even if I worked on migration for a few years...

I was thinking all devices should always get its chance to run for some
period during iterations.  Do you know the reasoning behind?  And I must
confess I also know little on block migration, which seems to be relevant
to this change.  Anyway, I also copy Jan just in case he'll be able to chim
in.

I am not 100% sure either, but I can guess:
This commit is pretty old (dates to 2009), so maybe back then the only
iterative devices were block devices and RAM.
Block devices didn't change as fast as RAM (and were probably much bigger
than RAM), so it made sense to send them first and only then send RAM, which
changed constantly.

Makes sense.  For some reason I read it the other way round previously.


If there is a fast changing device, even if we don't proceed with other
device iterators and we stick with the current one, assuming it can finally
finish dumping all data, but then we'll proceed with other devices and the
fast changing device can again accumulate dirty information?

I guess this logic only makes sense for the case where we only have block
devices and a RAM device, because the block devices wouldn't change that
much?


However, with switchover-ack capability enabled, this behavior can be
problematic and may prevent migration from converging. The problem lies
in the fact that an earlier device may never finish sending its data and
thus block other devices from sending theirs.

Yes, this is a problem.


This bug was observed in several VFIO migration scenarios where some
workload on the VM prevented RAM from ever reaching a hard zero, not
allowing VFIO initial pre-copy data to be sent, and thus destination
could not ack switchover. Note that the same scenario, but without
switchover-ack, would converge.

Fix it by not serializing device data sending during pre-copy iterative
phase if switchover was not acked yet.

I am still not fully convinced that it's even legal that one device can
consume all iterator's bandwidth, ignoring the rest..  Though again it's
not about this patch, but about commit 90697be889.

Yes, I agree. As I wrote above, maybe this was valid back then when the only
iterative devices were block and RAM.


I'm thinking whether we should allow each device to have its own portion of
chance to push data for each call to qemu_savevm_state_iterate(),
irrelevant of vfio's switchover-ack capability.

I wasn't sure for 100% why we have this logic in the first place, so I wrote
my patch as little invasive as possible, keeping migration behavior as is
(except for switchover-ack).
But I tend to agree with you for three reasons:

1. I think block migration is deprecated (see commits 66db46ca83b8,
40101f320d6b and 8846b5bfca76).
Instead, users are instructed to use blockdev-mirror and co. [1]. If I'm not
mistaken, this operates using a different infrastructure than migration.
So this logic is not relevant anymore.

2. As you pointed out earlier, the fast changing device can accumulate dirty
data over and over. VFIO devices come after RAM, so this logic doesn't
achieve its goal in this case (we may sync fast changing RAM over and over).

3. My fix in this patch won't solve a similar problem that could happen,
where a VFIO device with a lot of pre-copy data (not necessarily initial
data) may never be able to send it, thus not realizing the full potential of
pre-copy for VFIO.
(I personally have not encountered this problem yet, but maybe it can happen
with a vGPU).

Thanks for a summary.


If you agree, I can send a v2 that simply removes this logic and gives every
device an equal chance to send its data (like Joao showed) with some
explanation why we do it.

Let's see whether others have an opinion, but to me I think we can give it
a shot.  In that case we can "break" in the previous "ret < 0" check
already.

Sure.
So I will wait some more and may send v2 next week, if no special feedback
is received.


One more thing to mention is then I think we need to calculate the case of
"all iterators returned 1" (aka, "all completes") scenario.  With the old
check it is guaranteed if the loop iterates over all iter

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-28 Thread Peter Xu
On Wed, Feb 28, 2024 at 11:39:52AM +0200, Avihai Horon wrote:
> 
> On 28/02/2024 5:04, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Feb 28, 2024 at 02:00:26AM +0200, Avihai Horon wrote:
> > > On 27/02/2024 9:41, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:
> > > > > Currently, migration code serializes device data sending during 
> > > > > pre-copy
> > > > > iterative phase. As noted in the code comment, this is done to prevent
> > > > > faster changing device from sending its data over and over.
> > > > Frankly speaking I don't understand the rational behind 90697be889 
> > > > ("live
> > > > migration: Serialize vmstate saving in stage 2").  I don't even think I
> > > > noticed this logic before even if I worked on migration for a few 
> > > > years...
> > > > 
> > > > I was thinking all devices should always get its chance to run for some
> > > > period during iterations.  Do you know the reasoning behind?  And I must
> > > > confess I also know little on block migration, which seems to be 
> > > > relevant
> > > > to this change.  Anyway, I also copy Jan just in case he'll be able to 
> > > > chim
> > > > in.
> > > I am not 100% sure either, but I can guess:
> > > This commit is pretty old (dates to 2009), so maybe back then the only
> > > iterative devices were block devices and RAM.
> > > Block devices didn't change as fast as RAM (and were probably much bigger
> > > than RAM), so it made sense to send them first and only then send RAM, 
> > > which
> > > changed constantly.
> > Makes sense.  For some reason I read it the other way round previously.
> > 
> > > > If there is a fast changing device, even if we don't proceed with other
> > > > device iterators and we stick with the current one, assuming it can 
> > > > finally
> > > > finish dumping all data, but then we'll proceed with other devices and 
> > > > the
> > > > fast changing device can again accumulate dirty information?
> > > I guess this logic only makes sense for the case where we only have block
> > > devices and a RAM device, because the block devices wouldn't change that
> > > much?
> > > 
> > > > > However, with switchover-ack capability enabled, this behavior can be
> > > > > problematic and may prevent migration from converging. The problem 
> > > > > lies
> > > > > in the fact that an earlier device may never finish sending its data 
> > > > > and
> > > > > thus block other devices from sending theirs.
> > > > Yes, this is a problem.
> > > > 
> > > > > This bug was observed in several VFIO migration scenarios where some
> > > > > workload on the VM prevented RAM from ever reaching a hard zero, not
> > > > > allowing VFIO initial pre-copy data to be sent, and thus destination
> > > > > could not ack switchover. Note that the same scenario, but without
> > > > > switchover-ack, would converge.
> > > > > 
> > > > > Fix it by not serializing device data sending during pre-copy 
> > > > > iterative
> > > > > phase if switchover was not acked yet.
> > > > I am still not fully convinced that it's even legal that one device can
> > > > consume all iterator's bandwidth, ignoring the rest..  Though again it's
> > > > not about this patch, but about commit 90697be889.
> > > Yes, I agree. As I wrote above, maybe this was valid back then when the 
> > > only
> > > iterative devices were block and RAM.
> > > 
> > > > I'm thinking whether we should allow each device to have its own 
> > > > portion of
> > > > chance to push data for each call to qemu_savevm_state_iterate(),
> > > > irrelevant of vfio's switchover-ack capability.
> > > I wasn't sure for 100% why we have this logic in the first place, so I 
> > > wrote
> > > my patch as little invasive as possible, keeping migration behavior as is
> > > (except for switchover-ack).
> > > But I tend to agree with you for three reasons:
> > > 
> > > 1. I think block migration is deprecated (see commits 66db46ca83b8,
> > > 40101f320d6b and 8846b5bfca76).
> > > Instead, users are instructed to use blockdev-mirror and co. [1]. If I'm 
> > > not
> > > mistaken, this operates using a different infrastructure than migration.
> > > So this logic is not relevant anymore.
> > > 
> > > 2. As you pointed out earlier, the fast changing device can accumulate 
> > > dirty
> > > data over and over. VFIO devices come after RAM, so this logic doesn't
> > > achieve its goal in this case (we may sync fast changing RAM over and 
> > > over).
> > > 
> > > 3. My fix in this patch won't solve a similar problem that could happen,
> > > where a VFIO device with a lot of pre-copy data (not necessarily initial
> > > data) may never be able to send it, thus not realizing the full potential 
> > > of
> > > pre-copy for VFIO.
> > > (I personally have not encountered this problem yet, but maybe it can 
> > > happen
> > > with a vGPU).
> > Thanks for a su

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-28 Thread Avihai Horon



On 27/02/2024 5:16, Wang, Lei wrote:

External email: Use caution opening links or attachments


On 2/22/2024 23:56, Avihai Horon wrote:

Currently, migration code serializes device data sending during pre-copy
iterative phase. As noted in the code comment, this is done to prevent
faster changing device from sending its data over and over.

However, with switchover-ack capability enabled, this behavior can be
problematic and may prevent migration from converging. The problem lies
in the fact that an earlier device may never finish sending its data and
thus block other devices from sending theirs.

This bug was observed in several VFIO migration scenarios where some
workload on the VM prevented RAM from ever reaching a hard zero, not
allowing VFIO initial pre-copy data to be sent, and thus destination
could not ack switchover. Note that the same scenario, but without
switchover-ack, would converge.

Fix it by not serializing device data sending during pre-copy iterative
phase if switchover was not acked yet.

Hi Avihai,

Can this bug be solved by ordering the priority of different device's handlers?


Hi Lei,

Could be, but this would probably be more complicated to do.

Anyway, it looks like this serialization behavior is no longer relevant 
or valid, so we will go in the direction of removing it completely 
(please see other discussion with Peter).

But if you have some reason why this shouldn't be the way, please do tell.

Thanks.




Fixes: 1b4adb10f898 ("migration: Implement switchover ack logic")
Signed-off-by: Avihai Horon 
---
  migration/savevm.h|  2 +-
  migration/migration.c |  4 ++--
  migration/savevm.c| 22 +++---
  3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index 74669733dd6..d4a368b522b 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -36,7 +36,7 @@ void qemu_savevm_state_setup(QEMUFile *f);
  bool qemu_savevm_state_guest_unplug_pending(void);
  int qemu_savevm_state_resume_prepare(MigrationState *s);
  void qemu_savevm_state_header(QEMUFile *f);
-int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
+int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool can_switchover);
  void qemu_savevm_state_cleanup(void);
  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
  int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
diff --git a/migration/migration.c b/migration/migration.c
index ab21de2cadb..d8bfe1fb1b9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3133,7 +3133,7 @@ static MigIterateState 
migration_iteration_run(MigrationState *s)
  }

  /* Just another iteration step */
-qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
+qemu_savevm_state_iterate(s->to_dst_file, in_postcopy, can_switchover);
  return MIG_ITERATE_RESUME;
  }

@@ -3216,7 +3216,7 @@ static MigIterateState 
bg_migration_iteration_run(MigrationState *s)
  {
  int res;

-res = qemu_savevm_state_iterate(s->to_dst_file, false);
+res = qemu_savevm_state_iterate(s->to_dst_file, false, true);
  if (res > 0) {
  bg_migration_completion(s);
  return MIG_ITERATE_BREAK;
diff --git a/migration/savevm.c b/migration/savevm.c
index d612c8a9020..3a012796375 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1386,7 +1386,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
   *   0 : We haven't finished, caller have to go again
   *   1 : We have finished, we can go to complete phase
   */
-int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
+int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool can_switchover)
  {
  SaveStateEntry *se;
  int ret = 1;
@@ -1430,12 +1430,20 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool 
postcopy)
   "%d(%s): %d",
   se->section_id, se->idstr, ret);
  qemu_file_set_error(f, ret);
+return ret;
  }
-if (ret <= 0) {
-/* Do not proceed to the next vmstate before this one reported
-   completion of the current stage. This serializes the migration
-   and reduces the probability that a faster changing state is
-   synchronized over and over again. */
+
+if (ret == 0 && can_switchover) {
+/*
+ * Do not proceed to the next vmstate before this one reported
+ * completion of the current stage. This serializes the migration
+ * and reduces the probability that a faster changing state is
+ * synchronized over and over again.
+ * Do it only if migration can switchover. If migration can't
+ * switchover yet, do proceed to let other devices send their data
+ * too, as this may be required for switchover to be acked and
+ * migration to converge.
+ */
  break;
  }
  }
@@ -1724,7 +1732,7 @@

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-28 Thread Avihai Horon



On 28/02/2024 5:04, Peter Xu wrote:

External email: Use caution opening links or attachments


On Wed, Feb 28, 2024 at 02:00:26AM +0200, Avihai Horon wrote:

On 27/02/2024 9:41, Peter Xu wrote:

External email: Use caution opening links or attachments


On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:

Currently, migration code serializes device data sending during pre-copy
iterative phase. As noted in the code comment, this is done to prevent
faster changing device from sending its data over and over.

Frankly speaking I don't understand the rational behind 90697be889 ("live
migration: Serialize vmstate saving in stage 2").  I don't even think I
noticed this logic before even if I worked on migration for a few years...

I was thinking all devices should always get its chance to run for some
period during iterations.  Do you know the reasoning behind?  And I must
confess I also know little on block migration, which seems to be relevant
to this change.  Anyway, I also copy Jan just in case he'll be able to chim
in.

I am not 100% sure either, but I can guess:
This commit is pretty old (dates to 2009), so maybe back then the only
iterative devices were block devices and RAM.
Block devices didn't change as fast as RAM (and were probably much bigger
than RAM), so it made sense to send them first and only then send RAM, which
changed constantly.

Makes sense.  For some reason I read it the other way round previously.


If there is a fast changing device, even if we don't proceed with other
device iterators and we stick with the current one, assuming it can finally
finish dumping all data, but then we'll proceed with other devices and the
fast changing device can again accumulate dirty information?

I guess this logic only makes sense for the case where we only have block
devices and a RAM device, because the block devices wouldn't change that
much?


However, with switchover-ack capability enabled, this behavior can be
problematic and may prevent migration from converging. The problem lies
in the fact that an earlier device may never finish sending its data and
thus block other devices from sending theirs.

Yes, this is a problem.


This bug was observed in several VFIO migration scenarios where some
workload on the VM prevented RAM from ever reaching a hard zero, not
allowing VFIO initial pre-copy data to be sent, and thus destination
could not ack switchover. Note that the same scenario, but without
switchover-ack, would converge.

Fix it by not serializing device data sending during pre-copy iterative
phase if switchover was not acked yet.

I am still not fully convinced that it's even legal that one device can
consume all iterator's bandwidth, ignoring the rest..  Though again it's
not about this patch, but about commit 90697be889.

Yes, I agree. As I wrote above, maybe this was valid back then when the only
iterative devices were block and RAM.


I'm thinking whether we should allow each device to have its own portion of
chance to push data for each call to qemu_savevm_state_iterate(),
irrelevant of vfio's switchover-ack capability.

I wasn't sure for 100% why we have this logic in the first place, so I wrote
my patch as little invasive as possible, keeping migration behavior as is
(except for switchover-ack).
But I tend to agree with you for three reasons:

1. I think block migration is deprecated (see commits 66db46ca83b8,
40101f320d6b and 8846b5bfca76).
Instead, users are instructed to use blockdev-mirror and co. [1]. If I'm not
mistaken, this operates using a different infrastructure than migration.
So this logic is not relevant anymore.

2. As you pointed out earlier, the fast changing device can accumulate dirty
data over and over. VFIO devices come after RAM, so this logic doesn't
achieve its goal in this case (we may sync fast changing RAM over and over).

3. My fix in this patch won't solve a similar problem that could happen,
where a VFIO device with a lot of pre-copy data (not necessarily initial
data) may never be able to send it, thus not realizing the full potential of
pre-copy for VFIO.
(I personally have not encountered this problem yet, but maybe it can happen
with a vGPU).

Thanks for a summary.



If you agree, I can send a v2 that simply removes this logic and gives every
device an equal chance to send its data (like Joao showed) with some
explanation why we do it.

Let's see whether others have an opinion, but to me I think we can give it
a shot.  In that case we can "break" in the previous "ret < 0" check
already.


Sure.
So I will wait some more and may send v2 next week, if no special 
feedback is received.




One more thing to mention is then I think we need to calculate the case of
"all iterators returned 1" (aka, "all completes") scenario.  With the old
check it is guaranteed if the loop iterates over all iterators then all
iterators have completed.  Now we allow ret==0 to keep iterating, then the
check needs to be done explicitly.

In general, it could be somet

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-27 Thread Peter Xu
On Wed, Feb 28, 2024 at 02:00:26AM +0200, Avihai Horon wrote:
> 
> On 27/02/2024 9:41, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:
> > > Currently, migration code serializes device data sending during pre-copy
> > > iterative phase. As noted in the code comment, this is done to prevent
> > > faster changing device from sending its data over and over.
> > Frankly speaking I don't understand the rational behind 90697be889 ("live
> > migration: Serialize vmstate saving in stage 2").  I don't even think I
> > noticed this logic before even if I worked on migration for a few years...
> > 
> > I was thinking all devices should always get its chance to run for some
> > period during iterations.  Do you know the reasoning behind?  And I must
> > confess I also know little on block migration, which seems to be relevant
> > to this change.  Anyway, I also copy Jan just in case he'll be able to chim
> > in.
> 
> I am not 100% sure either, but I can guess:
> This commit is pretty old (dates to 2009), so maybe back then the only
> iterative devices were block devices and RAM.
> Block devices didn't change as fast as RAM (and were probably much bigger
> than RAM), so it made sense to send them first and only then send RAM, which
> changed constantly.

Makes sense.  For some reason I read it the other way round previously.

> 
> > 
> > If there is a fast changing device, even if we don't proceed with other
> > device iterators and we stick with the current one, assuming it can finally
> > finish dumping all data, but then we'll proceed with other devices and the
> > fast changing device can again accumulate dirty information?
> 
> I guess this logic only makes sense for the case where we only have block
> devices and a RAM device, because the block devices wouldn't change that
> much?
> 
> > 
> > > However, with switchover-ack capability enabled, this behavior can be
> > > problematic and may prevent migration from converging. The problem lies
> > > in the fact that an earlier device may never finish sending its data and
> > > thus block other devices from sending theirs.
> > Yes, this is a problem.
> > 
> > > This bug was observed in several VFIO migration scenarios where some
> > > workload on the VM prevented RAM from ever reaching a hard zero, not
> > > allowing VFIO initial pre-copy data to be sent, and thus destination
> > > could not ack switchover. Note that the same scenario, but without
> > > switchover-ack, would converge.
> > > 
> > > Fix it by not serializing device data sending during pre-copy iterative
> > > phase if switchover was not acked yet.
> > I am still not fully convinced that it's even legal that one device can
> > consume all iterator's bandwidth, ignoring the rest..  Though again it's
> > not about this patch, but about commit 90697be889.
> 
> Yes, I agree. As I wrote above, maybe this was valid back then when the only
> iterative devices were block and RAM.
> 
> > 
> > I'm thinking whether we should allow each device to have its own portion of
> > chance to push data for each call to qemu_savevm_state_iterate(),
> > irrelevant of vfio's switchover-ack capability.
> 
> I wasn't sure for 100% why we have this logic in the first place, so I wrote
> my patch as little invasive as possible, keeping migration behavior as is
> (except for switchover-ack).
> But I tend to agree with you for three reasons:
> 
> 1. I think block migration is deprecated (see commits 66db46ca83b8,
> 40101f320d6b and 8846b5bfca76).
> Instead, users are instructed to use blockdev-mirror and co. [1]. If I'm not
> mistaken, this operates using a different infrastructure than migration.
> So this logic is not relevant anymore.
> 
> 2. As you pointed out earlier, the fast changing device can accumulate dirty
> data over and over. VFIO devices come after RAM, so this logic doesn't
> achieve its goal in this case (we may sync fast changing RAM over and over).
> 
> 3. My fix in this patch won't solve a similar problem that could happen,
> where a VFIO device with a lot of pre-copy data (not necessarily initial
> data) may never be able to send it, thus not realizing the full potential of
> pre-copy for VFIO.
> (I personally have not encountered this problem yet, but maybe it can happen
> with a vGPU).

Thanks for a summary.

> 
> 
> If you agree, I can send a v2 that simply removes this logic and gives every
> device an equal chance to send its data (like Joao showed) with some
> explanation why we do it.

Let's see whether others have an opinion, but to me I think we can give it
a shot.  In that case we can "break" in the previous "ret < 0" check
already.

One more thing to mention is then I think we need to calculate the case of
"all iterators returned 1" (aka, "all completes") scenario.  With the old
check it is guaranteed if the loop iterates over all iterators then all
iterators have completed.  Now we allow ret==0 to 

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-27 Thread Avihai Horon



On 27/02/2024 9:41, Peter Xu wrote:

External email: Use caution opening links or attachments


On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:

Currently, migration code serializes device data sending during pre-copy
iterative phase. As noted in the code comment, this is done to prevent
faster changing device from sending its data over and over.

Frankly speaking I don't understand the rational behind 90697be889 ("live
migration: Serialize vmstate saving in stage 2").  I don't even think I
noticed this logic before even if I worked on migration for a few years...

I was thinking all devices should always get its chance to run for some
period during iterations.  Do you know the reasoning behind?  And I must
confess I also know little on block migration, which seems to be relevant
to this change.  Anyway, I also copy Jan just in case he'll be able to chim
in.


I am not 100% sure either, but I can guess:
This commit is pretty old (dates to 2009), so maybe back then the only 
iterative devices were block devices and RAM.
Block devices didn't change as fast as RAM (and were probably much 
bigger than RAM), so it made sense to send them first and only then send 
RAM, which changed constantly.




If there is a fast changing device, even if we don't proceed with other
device iterators and we stick with the current one, assuming it can finally
finish dumping all data, but then we'll proceed with other devices and the
fast changing device can again accumulate dirty information?


I guess this logic only makes sense for the case where we only have 
block devices and a RAM device, because the block devices wouldn't 
change that much?





However, with switchover-ack capability enabled, this behavior can be
problematic and may prevent migration from converging. The problem lies
in the fact that an earlier device may never finish sending its data and
thus block other devices from sending theirs.

Yes, this is a problem.


This bug was observed in several VFIO migration scenarios where some
workload on the VM prevented RAM from ever reaching a hard zero, not
allowing VFIO initial pre-copy data to be sent, and thus destination
could not ack switchover. Note that the same scenario, but without
switchover-ack, would converge.

Fix it by not serializing device data sending during pre-copy iterative
phase if switchover was not acked yet.

I am still not fully convinced that it's even legal that one device can
consume all iterator's bandwidth, ignoring the rest..  Though again it's
not about this patch, but about commit 90697be889.


Yes, I agree. As I wrote above, maybe this was valid back then when the 
only iterative devices were block and RAM.




I'm thinking whether we should allow each device to have its own portion of
chance to push data for each call to qemu_savevm_state_iterate(),
irrelevant of vfio's switchover-ack capability.


I wasn't sure for 100% why we have this logic in the first place, so I 
wrote my patch as little invasive as possible, keeping migration 
behavior as is (except for switchover-ack).

But I tend to agree with you for three reasons:

1. I think block migration is deprecated (see commits 66db46ca83b8, 
40101f320d6b and 8846b5bfca76).
Instead, users are instructed to use blockdev-mirror and co. [1]. If I'm 
not mistaken, this operates using a different infrastructure than migration.

So this logic is not relevant anymore.

2. As you pointed out earlier, the fast changing device can accumulate 
dirty data over and over. VFIO devices come after RAM, so this logic 
doesn't achieve its goal in this case (we may sync fast changing RAM 
over and over).


3. My fix in this patch won't solve a similar problem that could happen, 
where a VFIO device with a lot of pre-copy data (not necessarily initial 
data) may never be able to send it, thus not realizing the full 
potential of pre-copy for VFIO.
(I personally have not encountered this problem yet, but maybe it can 
happen with a vGPU).



If you agree, I can send a v2 that simply removes this logic and gives 
every device an equal chance to send its data (like Joao showed) with 
some explanation why we do it.
We could also give RAM precedence over other devices only during the 
first iteration of sending RAM (i.e., only until first dirty sync), but 
I don't know how much benefit this would give.


[1] https://qemu-project.gitlab.io/qemu/interop/live-block-operations.html




Fixes: 1b4adb10f898 ("migration: Implement switchover ack logic")
Signed-off-by: Avihai Horon 
---
  migration/savevm.h|  2 +-
  migration/migration.c |  4 ++--
  migration/savevm.c| 22 +++---
  3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index 74669733dd6..d4a368b522b 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -36,7 +36,7 @@ void qemu_savevm_state_setup(QEMUFile *f);
  bool qemu_savevm_state_guest_unplug_pending(void);
  int qemu_savevm_state_resume_prepare(Mi

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-27 Thread Joao Martins
On 27/02/2024 07:41, Peter Xu wrote:
> On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:
>> This bug was observed in several VFIO migration scenarios where some
>> workload on the VM prevented RAM from ever reaching a hard zero, not
>> allowing VFIO initial pre-copy data to be sent, and thus destination
>> could not ack switchover. Note that the same scenario, but without
>> switchover-ack, would converge.
>>
>> Fix it by not serializing device data sending during pre-copy iterative
>> phase if switchover was not acked yet.
> 
> I am still not fully convinced that it's even legal that one device can
> consume all iterator's bandwidth, ignoring the rest..  Though again it's
> not about this patch, but about commit 90697be889.
> 
> I'm thinking whether we should allow each device to have its own portion of
> chance to push data for each call to qemu_savevm_state_iterate(),
> irrelevant of vfio's switchover-ack capability.
> 

I guess that this means the only change needed is (...)

>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index d612c8a9020..3a012796375 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -1386,7 +1386,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>>   *   0 : We haven't finished, caller have to go again
>>   *   1 : We have finished, we can go to complete phase
>>   */
>> -int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
>> +int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool 
>> can_switchover)
>>  {
>>  SaveStateEntry *se;
>>  int ret = 1;
>> @@ -1430,12 +1430,20 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool 
>> postcopy)
>>   "%d(%s): %d",
>>   se->section_id, se->idstr, ret);
>>  qemu_file_set_error(f, ret);
>> +return ret;
>>  }
>> -if (ret <= 0) {
>> -/* Do not proceed to the next vmstate before this one reported
>> -   completion of the current stage. This serializes the 
>> migration
>> -   and reduces the probability that a faster changing state is
>> -   synchronized over and over again. */
>> +
>> +if (ret == 0 && can_switchover) {
>> +/*
>> + * Do not proceed to the next vmstate before this one reported
>> + * completion of the current stage. This serializes the 
>> migration
>> + * and reduces the probability that a faster changing state is
>> + * synchronized over and over again.
>> + * Do it only if migration can switchover. If migration can't
>> + * switchover yet, do proceed to let other devices send their 
>> data
>> + * too, as this may be required for switchover to be acked and
>> + * migration to converge.
>> + */
>>  break;
>>  }
>>  }

(...) is here to have:

if (ret < 0) {
break;
}

? Or you were thinking in some heuristic?

Joao



Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-26 Thread Peter Xu
On Thu, Feb 22, 2024 at 05:56:27PM +0200, Avihai Horon wrote:
> Currently, migration code serializes device data sending during pre-copy
> iterative phase. As noted in the code comment, this is done to prevent
> faster changing device from sending its data over and over.

Frankly speaking I don't understand the rational behind 90697be889 ("live
migration: Serialize vmstate saving in stage 2").  I don't even think I
noticed this logic before even if I worked on migration for a few years...

I was thinking all devices should always get its chance to run for some
period during iterations.  Do you know the reasoning behind?  And I must
confess I also know little on block migration, which seems to be relevant
to this change.  Anyway, I also copy Jan just in case he'll be able to chim
in.

If there is a fast changing device, even if we don't proceed with other
device iterators and we stick with the current one, assuming it can finally
finish dumping all data, but then we'll proceed with other devices and the
fast changing device can again accumulate dirty information?

> 
> However, with switchover-ack capability enabled, this behavior can be
> problematic and may prevent migration from converging. The problem lies
> in the fact that an earlier device may never finish sending its data and
> thus block other devices from sending theirs.

Yes, this is a problem.

> 
> This bug was observed in several VFIO migration scenarios where some
> workload on the VM prevented RAM from ever reaching a hard zero, not
> allowing VFIO initial pre-copy data to be sent, and thus destination
> could not ack switchover. Note that the same scenario, but without
> switchover-ack, would converge.
> 
> Fix it by not serializing device data sending during pre-copy iterative
> phase if switchover was not acked yet.

I am still not fully convinced that it's even legal that one device can
consume all iterator's bandwidth, ignoring the rest..  Though again it's
not about this patch, but about commit 90697be889.

I'm thinking whether we should allow each device to have its own portion of
chance to push data for each call to qemu_savevm_state_iterate(),
irrelevant of vfio's switchover-ack capability.

> 
> Fixes: 1b4adb10f898 ("migration: Implement switchover ack logic")
> Signed-off-by: Avihai Horon 
> ---
>  migration/savevm.h|  2 +-
>  migration/migration.c |  4 ++--
>  migration/savevm.c| 22 +++---
>  3 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 74669733dd6..d4a368b522b 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -36,7 +36,7 @@ void qemu_savevm_state_setup(QEMUFile *f);
>  bool qemu_savevm_state_guest_unplug_pending(void);
>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_state_header(QEMUFile *f);
> -int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> +int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool 
> can_switchover);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
> diff --git a/migration/migration.c b/migration/migration.c
> index ab21de2cadb..d8bfe1fb1b9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3133,7 +3133,7 @@ static MigIterateState 
> migration_iteration_run(MigrationState *s)
>  }
>  
>  /* Just another iteration step */
> -qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
> +qemu_savevm_state_iterate(s->to_dst_file, in_postcopy, can_switchover);
>  return MIG_ITERATE_RESUME;
>  }
>  
> @@ -3216,7 +3216,7 @@ static MigIterateState 
> bg_migration_iteration_run(MigrationState *s)
>  {
>  int res;
>  
> -res = qemu_savevm_state_iterate(s->to_dst_file, false);
> +res = qemu_savevm_state_iterate(s->to_dst_file, false, true);
>  if (res > 0) {
>  bg_migration_completion(s);
>  return MIG_ITERATE_BREAK;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d612c8a9020..3a012796375 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1386,7 +1386,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>   *   0 : We haven't finished, caller have to go again
>   *   1 : We have finished, we can go to complete phase
>   */
> -int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> +int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool 
> can_switchover)
>  {
>  SaveStateEntry *se;
>  int ret = 1;
> @@ -1430,12 +1430,20 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool 
> postcopy)
>   "%d(%s): %d",
>   se->section_id, se->idstr, ret);
>  qemu_file_set_error(f, ret);
> +return ret;
>  }
> -if (ret <= 0) {
> -/* Do not proceed to the next vmstate before this one reported
> -   completion of the curren

Re: [PATCH] migration: Don't serialize migration while can't switchover

2024-02-26 Thread Wang, Lei
On 2/22/2024 23:56, Avihai Horon wrote:
> Currently, migration code serializes device data sending during pre-copy
> iterative phase. As noted in the code comment, this is done to prevent
> faster changing device from sending its data over and over.
> 
> However, with switchover-ack capability enabled, this behavior can be
> problematic and may prevent migration from converging. The problem lies
> in the fact that an earlier device may never finish sending its data and
> thus block other devices from sending theirs.
> 
> This bug was observed in several VFIO migration scenarios where some
> workload on the VM prevented RAM from ever reaching a hard zero, not
> allowing VFIO initial pre-copy data to be sent, and thus destination
> could not ack switchover. Note that the same scenario, but without
> switchover-ack, would converge.
> 
> Fix it by not serializing device data sending during pre-copy iterative
> phase if switchover was not acked yet.

Hi Avihai,

Can this bug be solved by ordering the priority of different device's handlers?

> 
> Fixes: 1b4adb10f898 ("migration: Implement switchover ack logic")
> Signed-off-by: Avihai Horon 
> ---
>  migration/savevm.h|  2 +-
>  migration/migration.c |  4 ++--
>  migration/savevm.c| 22 +++---
>  3 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/savevm.h b/migration/savevm.h
> index 74669733dd6..d4a368b522b 100644
> --- a/migration/savevm.h
> +++ b/migration/savevm.h
> @@ -36,7 +36,7 @@ void qemu_savevm_state_setup(QEMUFile *f);
>  bool qemu_savevm_state_guest_unplug_pending(void);
>  int qemu_savevm_state_resume_prepare(MigrationState *s);
>  void qemu_savevm_state_header(QEMUFile *f);
> -int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
> +int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool 
> can_switchover);
>  void qemu_savevm_state_cleanup(void);
>  void qemu_savevm_state_complete_postcopy(QEMUFile *f);
>  int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
> diff --git a/migration/migration.c b/migration/migration.c
> index ab21de2cadb..d8bfe1fb1b9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3133,7 +3133,7 @@ static MigIterateState 
> migration_iteration_run(MigrationState *s)
>  }
>  
>  /* Just another iteration step */
> -qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
> +qemu_savevm_state_iterate(s->to_dst_file, in_postcopy, can_switchover);
>  return MIG_ITERATE_RESUME;
>  }
>  
> @@ -3216,7 +3216,7 @@ static MigIterateState 
> bg_migration_iteration_run(MigrationState *s)
>  {
>  int res;
>  
> -res = qemu_savevm_state_iterate(s->to_dst_file, false);
> +res = qemu_savevm_state_iterate(s->to_dst_file, false, true);
>  if (res > 0) {
>  bg_migration_completion(s);
>  return MIG_ITERATE_BREAK;
> diff --git a/migration/savevm.c b/migration/savevm.c
> index d612c8a9020..3a012796375 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1386,7 +1386,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
>   *   0 : We haven't finished, caller have to go again
>   *   1 : We have finished, we can go to complete phase
>   */
> -int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
> +int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool 
> can_switchover)
>  {
>  SaveStateEntry *se;
>  int ret = 1;
> @@ -1430,12 +1430,20 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool 
> postcopy)
>   "%d(%s): %d",
>   se->section_id, se->idstr, ret);
>  qemu_file_set_error(f, ret);
> +return ret;
>  }
> -if (ret <= 0) {
> -/* Do not proceed to the next vmstate before this one reported
> -   completion of the current stage. This serializes the migration
> -   and reduces the probability that a faster changing state is
> -   synchronized over and over again. */
> +
> +if (ret == 0 && can_switchover) {
> +/*
> + * Do not proceed to the next vmstate before this one reported
> + * completion of the current stage. This serializes the migration
> + * and reduces the probability that a faster changing state is
> + * synchronized over and over again.
> + * Do it only if migration can switchover. If migration can't
> + * switchover yet, do proceed to let other devices send their 
> data
> + * too, as this may be required for switchover to be acked and
> + * migration to converge.
> + */
>  break;
>  }
>  }
> @@ -1724,7 +1732,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
>  qemu_savevm_state_setup(f);
>  
>  while (qemu_file_get_error(f) == 0) {
> -if (qemu_savevm_state_iterate(f, false) > 0) {
> +if (qemu_savevm_state_iterate(f, false, true) 

[PATCH] migration: Don't serialize migration while can't switchover

2024-02-22 Thread Avihai Horon
Currently, migration code serializes device data sending during pre-copy
iterative phase. As noted in the code comment, this is done to prevent
faster changing device from sending its data over and over.

However, with switchover-ack capability enabled, this behavior can be
problematic and may prevent migration from converging. The problem lies
in the fact that an earlier device may never finish sending its data and
thus block other devices from sending theirs.

This bug was observed in several VFIO migration scenarios where some
workload on the VM prevented RAM from ever reaching a hard zero, not
allowing VFIO initial pre-copy data to be sent, and thus destination
could not ack switchover. Note that the same scenario, but without
switchover-ack, would converge.

Fix it by not serializing device data sending during pre-copy iterative
phase if switchover was not acked yet.

Fixes: 1b4adb10f898 ("migration: Implement switchover ack logic")
Signed-off-by: Avihai Horon 
---
 migration/savevm.h|  2 +-
 migration/migration.c |  4 ++--
 migration/savevm.c| 22 +++---
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/migration/savevm.h b/migration/savevm.h
index 74669733dd6..d4a368b522b 100644
--- a/migration/savevm.h
+++ b/migration/savevm.h
@@ -36,7 +36,7 @@ void qemu_savevm_state_setup(QEMUFile *f);
 bool qemu_savevm_state_guest_unplug_pending(void);
 int qemu_savevm_state_resume_prepare(MigrationState *s);
 void qemu_savevm_state_header(QEMUFile *f);
-int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy);
+int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool can_switchover);
 void qemu_savevm_state_cleanup(void);
 void qemu_savevm_state_complete_postcopy(QEMUFile *f);
 int qemu_savevm_state_complete_precopy(QEMUFile *f, bool iterable_only,
diff --git a/migration/migration.c b/migration/migration.c
index ab21de2cadb..d8bfe1fb1b9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3133,7 +3133,7 @@ static MigIterateState 
migration_iteration_run(MigrationState *s)
 }
 
 /* Just another iteration step */
-qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
+qemu_savevm_state_iterate(s->to_dst_file, in_postcopy, can_switchover);
 return MIG_ITERATE_RESUME;
 }
 
@@ -3216,7 +3216,7 @@ static MigIterateState 
bg_migration_iteration_run(MigrationState *s)
 {
 int res;
 
-res = qemu_savevm_state_iterate(s->to_dst_file, false);
+res = qemu_savevm_state_iterate(s->to_dst_file, false, true);
 if (res > 0) {
 bg_migration_completion(s);
 return MIG_ITERATE_BREAK;
diff --git a/migration/savevm.c b/migration/savevm.c
index d612c8a9020..3a012796375 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1386,7 +1386,7 @@ int qemu_savevm_state_resume_prepare(MigrationState *s)
  *   0 : We haven't finished, caller have to go again
  *   1 : We have finished, we can go to complete phase
  */
-int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy)
+int qemu_savevm_state_iterate(QEMUFile *f, bool postcopy, bool can_switchover)
 {
 SaveStateEntry *se;
 int ret = 1;
@@ -1430,12 +1430,20 @@ int qemu_savevm_state_iterate(QEMUFile *f, bool 
postcopy)
  "%d(%s): %d",
  se->section_id, se->idstr, ret);
 qemu_file_set_error(f, ret);
+return ret;
 }
-if (ret <= 0) {
-/* Do not proceed to the next vmstate before this one reported
-   completion of the current stage. This serializes the migration
-   and reduces the probability that a faster changing state is
-   synchronized over and over again. */
+
+if (ret == 0 && can_switchover) {
+/*
+ * Do not proceed to the next vmstate before this one reported
+ * completion of the current stage. This serializes the migration
+ * and reduces the probability that a faster changing state is
+ * synchronized over and over again.
+ * Do it only if migration can switchover. If migration can't
+ * switchover yet, do proceed to let other devices send their data
+ * too, as this may be required for switchover to be acked and
+ * migration to converge.
+ */
 break;
 }
 }
@@ -1724,7 +1732,7 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
 qemu_savevm_state_setup(f);
 
 while (qemu_file_get_error(f) == 0) {
-if (qemu_savevm_state_iterate(f, false) > 0) {
+if (qemu_savevm_state_iterate(f, false, true) > 0) {
 break;
 }
 }
-- 
2.26.3