On Mon, May 15, 2023 at 10:32:01AM +0200, Juan Quintela wrote: > When we detect that we have broken backwards compantibility in a > released version, we can't do anything for that version. But once we > fix that bug on the next released version, we can "mitigate" that > problem when migrating to new versions to give a way out of that > machine until it does a hard reboot. > > Signed-off-by: Juan Quintela <quint...@redhat.com> > --- > docs/devel/migration.rst | 194 +++++++++++++++++++++++++++++++++++++++ > 1 file changed, 194 insertions(+) > > diff --git a/docs/devel/migration.rst b/docs/devel/migration.rst > index 95e797ee60..97b6f48474 100644 > --- a/docs/devel/migration.rst > +++ b/docs/devel/migration.rst > @@ -451,6 +451,200 @@ binary in both sides of the migration. If we use > different QEMU > versions process, then we need to have into account all other > differences and the examples become even more complicated. > > +How to mitigate when we have a backward compatibility error > +----------------------------------------------------------- > + > +We broke migration for old machine types continously during > +development. But as soon as we find that there is a problem, we fix > +it. The problem is what happens when we detect after we have done a > +release that something has gone wrong. > + > +Let see how it worked with one example. > + > +After the release of qemu-8.0 we found a problem when doing migration > +of the machine type pc-7.2. > + > +- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 > + > + This migration works > + > +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2 > + > + This migration works > + > +- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 > + > + This migration fails > + > +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 > + > + This migration fails > + > +So clearly something fails when migration between qemu-7.2 and > +qemu-8.0 with machine type pc-7.2. The error messages, and git bisect > +pointed to this commit. > + > +In qemu-8.0 we got this commit: :: > + > + commit 9a6ef182c03eaa138bae553f0fbb5a123bef9a53 > + Author: Jonathan Cameron <jonathan.came...@huawei.com> > + Date: Thu Mar 2 13:37:03 2023 +0000 > + > + hw/pci/aer: Add missing routing for AER errors > + > +The relevant bits of the commit for our example are this ones: > + > + --- a/hw/pci/pcie_aer.c > + +++ b/hw/pci/pcie_aer.c > + @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev, > + > + pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, > + PCI_ERR_UNC_SUPPORTED); > + + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > + + PCI_ERR_UNC_MASK_DEFAULT); > + + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > + + PCI_ERR_UNC_SUPPORTED); > + > + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, > + PCI_ERR_UNC_SEVERITY_DEFAULT); > + > +The patch changes how we configure pci space for AER. But qemu fails > +when the pci space configuration is different betwwen source and > +destination. > + > +The following commit show how this got fixed: > + > +<put info of the commit once that it arrives upstream> > + > +The relevant parts of the fix are as follow: > + > +First, we create a new property for the device to be able to configure > +the old behaviour or the new behaviour. :: > + > + diff --git a/hw/pci/pci.c b/hw/pci/pci.c > + index 8a87ccc8b0..5153ad63d6 100644 > + --- a/hw/pci/pci.c > + +++ b/hw/pci/pci.c > + @@ -79,6 +79,8 @@ static Property pci_props[] = { > + DEFINE_PROP_STRING("failover_pair_id", PCIDevice, > + failover_pair_id), > + DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0), > + + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present, > + + QEMU_PCIE_ERR_UNC_MASK_BITNR, true), > + DEFINE_PROP_END_OF_LIST() > + }; > + > +Notice that we enable te feature for new machine types. > + > +Now we see how the fix is done. This is going to depend on what kind > +of breakage happens, but in this case it is quite simple. :: > + > + diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c > + index 103667c368..374d593ead 100644 > + --- a/hw/pci/pcie_aer.c > + +++ b/hw/pci/pcie_aer.c > + @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, > + uint16_t offset, > + > + pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, > + PCI_ERR_UNC_SUPPORTED); > + - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > + - PCI_ERR_UNC_MASK_DEFAULT); > + - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > + - PCI_ERR_UNC_SUPPORTED); > + + > + + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { > + + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, > + + PCI_ERR_UNC_MASK_DEFAULT); > + + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, > + + PCI_ERR_UNC_SUPPORTED); > + + } > + > + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, > + PCI_ERR_UNC_SEVERITY_DEFAULT); > + > +I.e. If the property bit is enabled, we configure it as we did for > +qemu-8.0. If the property bit is not set, we configure it as it was in 7.2. > + > +And now, everything that is missing is disable the feature for old > +machine types: :: > + > + diff --git a/hw/core/machine.c b/hw/core/machine.c > + index 47a34841a5..07f763eb2e 100644 > + --- a/hw/core/machine.c > + +++ b/hw/core/machine.c > + @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = { > + { "e1000e", "migrate-timadj", "off" }, > + { "virtio-mem", "x-early-migration", "false" }, > + { "migration", "x-preempt-pre-7-2", "true" }, > + + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, > + }; > + const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2); > + > +And now, when qemu-8.0.1 is released with this fix, all combinations > +are going to work as supposed. > + > +- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works) > +- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works) > +- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works) > +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works) > + > +So the normality has been restaured and everything is ok, no? > + > +Not really, now our matrix is much bigger. We started with the easy > +cases, migration from the same version to the same version always > +works: > + > +- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 > +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2 > +- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > + > +Now the interesting ones. When the QEMU processes versions are > +different. For the 1st set, their fail and we can do nothing, both > +versions are relased and we can't change anything. > + > +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 > +- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 > + > +This two are the ones that work. The whole point of making the > +change in qemu-8.0.1 release was to fix this issue: > + > +- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > +- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 > + > +But now we found that qemu-8.0 neither can migrate to qemu-7.2 not > +qemu-8.0.1. > + > +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > +- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2 > + > +So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to > +anything except to qemu-8.0. > + > +Can we do better? > + > +Yeap. If we know that we are gonig to do this migration: > + > +- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 > + > +We can launche the appropiate devices with > + > +--device...,x-pci-e-err-unc-mask=on > + > +And now we can receive a migration from 8.0. And from now on, we can > +do that migration to new machine types if we remember to enable that > +property for pc-7.2. Notice that we need to remember, it is not > +enough to know that the source of the migration is qemu-8.0. Think of this > example: > + > +$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2 > + > +In the second migration, the source is not qemu-8.0, but we still have > +that "problem" and have that property enabled. Notice that we need to > +continue having this mark/property until we have this machine > +rebooted. But it is not a normal reboot (that don't reload qemu) we > +need the mapchine to poweroff/poweron on a fixed qemu. And from now > +on we can use the proper real machine. > + > VMState > -------
Can we release this list of things that need to be configured somewhere? Maybe in a sane format that libvirt can parse? > -- > 2.40.1