On 10/05/2023 19:41, Juan Quintela wrote:
External email: Use caution opening links or attachments


Avihai Horon <avih...@nvidia.com> wrote:

You have a point here.
But I will approach this case in a different way:

Destination QEMU needs to be older, because it don't have the feature.
So we need to NOT being able to do the switchover for older machine
types.
And have something like this is qemu/hw/machine.c

GlobalProperty hw_compat_7_2[] = {
      { "our_device", "explicit-switchover", "off" },
};

Or whatever we want to call the device and the property, and not use it
for older machine types to allow migration for that.
Let me see if I get this straight (I'm not that familiar with
hw_compat_x_y):

You mean that device Y which adds support for explicit-switchover in
QEMU version Z should add a property
like you wrote above, and use it to disable explicit-switchover usage
for Y devices when Y device
from QEMU older than Z is migrated?
More that "from" "to"

Let me elaborate.  We have two QEMUs:

QEMU version X, has device dev. Let's call it qemu-X.
QEMU version Y (X+1) add feature foo to device dev.  Let's call it qemu-Y.

We have two machine types (for this exercise we don't care about
architectures)

PC-X.0
PC-Y.0

So, the possible combinations are:

First the easy cases, same qemu on both sides.  Different machine types.

$ qemu-X -M PC-X.0   -> to -> qemu-X -M PC-X.0

   good. neither guest use feature foo.

$ qemu-X -M PC-Y.0   -> to -> qemu-X -M PC-Y.0

   impossible. qemu-X don't have machine PC-Y.0.  So nothing to see here.

$ qemu-Y -M PC-X.0   -> to -> qemu-Y -M PC-X.0

   good.  We have feature foo in both sides. Notice that I recomend here
   to not use feature foo.  We will see on the difficult cases.

$ qemu-Y -M PC-Y.0   -> to -> qemu-Y -M PC-Y.0

   good.  Both sides use feature foo.

Difficult cases, when we mix qemu versions.

$ qemu-X -M PC-X.0  -> to -> qemu-Y -M PC-X.0

   source don't have feature foo.  Destination have feature foo.
   But if we disable it for machine PC-X.0, it will work.

$ qemu-Y -M PC-X.0  -> to -> qemu-X -M PC-X.0

   same than previous example.  But here we have feature foo on source
   and not on destination.  Disabling it for machine PC-X.0 fixes the
   problem.

We can't migrate a PC-Y.0 when one of the qemu's is qemu-X, so that case
is impossible.

Does this makes more sense?

And now, how hw_compat_X_Y works.

It is an array of registers with the format:

- name of device  (we give some rope here, for instance migration is a
   device in this context)

- name of property: self explanatory.  The important bit is that
   we can get the value of the property in the device driver.

- value of the property: self explanatory.

With this mechanism what we do when we add a feature to a device that
matters for migration is:
- for the machine type of the version that we are "developing" feature
   is enabled by default.  For whatever that enable means.

- for old machine types we disable the feature, so it can be migrate
   freely with old qemu. But using the old machine type.

- there is way to enable the feature on the command line even for old
   machine types on new qemu, but only developers use that for testing.
   Normal users/admins never do that.

what does hw_compat_7_2 means?

Ok, we need to know the versions.  New version is 8.0.

hw_compat_7_2 has all the properties represensing "features", defaults,
whatever that has changed since 7.2.  In other words, what features we
need to disable to get to the features that existed when 7.2 was
released.

To go to a real example.

In the development tree.  We have:

GlobalProperty hw_compat_8_0[] = {
     { "migration", "multifd-flush-after-each-section", "on"},
};
const size_t hw_compat_8_0_len = G_N_ELEMENTS(hw_compat_8_0);

Feature is implemented in the following commits:

77c259a4cb1c9799754b48f570301ebf1de5ded8
b05292c237030343516d073b1a1e5f49ffc017a8
294e5a4034e81b3d8db03b4e0f691386f20d6ed3

When we are doing migration with multifd and we pass the end of memory
(i.e. we end one iteration through all the RAM) we need to make sure
that we don't send the same page through two channels, i.e. contents of
the page at iteration 1 through channel 1 and contents of the page at
iteration 2 through channel 2.  The problem is that they could arrive
out of order and the of page of iteration 1 arrive later than iteration
2 and overwrite new data with old data.  Which is undesirable.
We could use complex algorithms to fix that, but one easy way of doing
it is:

- When we finish a run through all memory (i.e.) one iteration, we flush
   all channels and make sure that everything arrives to destination
   before starting sending data o the next iteration.  I call that
   synchronize all channels.

And that is what we *should* have done.  But when I implemented the
feature, I did this synchronization everytime that we finish a cycle
(around 100miliseconds).  i.e. 10 times per second. This is called a
section for historical reasons. And when you are migrating
multiterabytes RAM machines with very fast networking, we end waiting
too much on the synchronizations.

Once detected the problem and found the cause, we change that.  The
problem is that if we are running an old qemu against a new qemu (or
viceversa) we would not be able to migrate, because one send/expects
synchronizations at different points.

So we have to maintain the old algorithm and the new algoritm.  That is
what we did here.  For machines older than <current in development>,
i.e. 8.0 we use the old algorithm (multifd-flush-after-earch section is
"on").

But the default for new machine types is the new algorithm, much faster.

I know that the explanation has been quite long, but inventing an
example was going to be even more complex.

Does this makes sense?

Yes, thanks a lot for the full and detailed explanation!
This indeed solves the problem in the scenario I mentioned above.

However, this relies on the fact that a device support for this feature depends only on the QEMU version.
This is not the case for VFIO devices.
To support explicit-switchover, a VFIO device also needs host kernel support for VFIO precopy, i.e., it needs to have the VFIO_MIGRATION_PRE_COPY flag set.
So, theoretically we could have the following:
- Source and destination QEMU are the same version.
- We migrate two different VFIO devices (i.e., they don't share the same kernel driver), device X and device Y. - Host kernel in source supports VIFO precopy for device X but not for device Y. - Host kernel in destination supports VFIO precopy for both device X and device Y.
Without explicit-switchover, migration should work.
But if we enable explicit-switchover and do migration, we would end up in the same situation where switchover_pending=2 in destination and it never reaches zero so migration is stuck.

This could be solved by moving the switchover_pending counter to the source and sending multiple MIG_RP explicit-switchover ACK messages. However, I also raised a concern about this in my last mail to Peter [1], where this is not guaranteed to work, depending on the device implementation for explicit-switchover feature.

Not sure though if I'm digging too deep in some improbable future corner cases.

Let's go back to the basic question, which is whether we need to send an "advise" message for each device that supports explicit-switchover. I think it gives us more flexibility and although not needed at the moment, might be useful in the future.

If you want I can send a v2 that addresses the comments and simplifies the code in other areas and we'll continue discussing the necessity of the "advise" message then.

Thanks!

[1] https://lore.kernel.org/qemu-devel/688acb4e-a4e6-428d-9124-7596e3666...@nvidia.com/


Reply via email to