QEMU maintains disk boot order and geometry information in the lists
"fw_boot_order" and "fw_lchs". For example, boot order is derived from each
device's bootindex value.

During system reset, and only during system reset, QEMU updates the
"bootorder" and "bios-geometry" entries in fw_cfg based on the contents of
"fw_boot_order" and "fw_lchs". After the guest VM boots, the firmware
(e.g., SeaBIOS) can read the boot order from fw_cfg and boot from the disk
at the top of the list.

The reset handler fw_cfg_machine_reset() is invoked either implicitly
during instance creation or explicitly by the user via HMP/QMP.

However, users may attach or detach disks while the VM is in the prelaunch
state. Because there is no implicit reset when transitioning from prelaunch
to running, the "bootorder" and "bios-geometry" data in fw_cfg can become
stale. As a result, the firmware may be unable to locate the correct disk
to boot from.

Here is an example that demonstrates the bug.

1. Create a QEMU instance with a virtio-scsi HBA and keep it in the
prelaunch state. Use SeaBIOS rather than UEFI.

-device virtio-scsi-pci,id=scsi0,num_queues=4 \
-S \

2. First, attach the boot disk, then attach the secondary disk.

(qemu) drive_add 0 file=boot.qcow2,if=none,id=drive0
(qemu) device_add 
scsi-hd,drive=drive0,bus=scsi0.0,channel=0,scsi-id=0,lun=1,bootindex=1
(qemu) drive_add 0 file=secondary.qcow2,if=none,id=drive1
(qemu) device_add 
scsi-hd,drive=drive1,bus=scsi0.0,channel=0,scsi-id=0,lun=2,bootindex=-1

3. Start the VM from the prelaunch state. Because the "bootorder" and
"bios-geometry" data in fw_cfg is stale, SeaBIOS attempts to boot from the
secondary disk only once and then stops. As a result, the VM fails to boot.

One possible workaround is to require QEMU users to explicitly issue a
system_reset before starting a guest VM from the prelaunch state, if any
disks have been attached or detached.

Another option is to address the issue in SeaBIOS. Nowadays, SeaBIOS
attempts to boot from only a single disk. We could enhance SeaBIOS to try
multiple disks in order until boot succeeds.

Another option is to update "bootorder" and "bios-geometry" everywhere
disks are attached or detached. This may require identifying the relevant
functions across multiple device types, such as SCSI, NVMe, virtio-blk, and
IDE.

This commit fixes the issue in QEMU by ensuring that "bootorder" and
"bios-geometry" are always updated when QEMU transitions from the prelaunch
state to running.

According to runstate_transitions_def[], RUN_STATE_PRELAUNCH is allowed to
transition to four states. Only the transition to RUN_STATE_RUNNING
requires updating "bootorder" and "bios-geometry".

Co-developed-by: Joe Jin <[email protected]>
Signed-off-by: Dongli Zhang <[email protected]>
---
v1 -> v2:
  - Add new runstate tranisition notifier to track transition from
    prelaunch to running

 hw/nvram/fw_cfg.c         | 31 +++++++++++++++++++++++++++++--
 include/hw/nvram/fw_cfg.h |  1 +
 2 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/hw/nvram/fw_cfg.c b/hw/nvram/fw_cfg.c
index 1d7d835421..5154d77028 100644
--- a/hw/nvram/fw_cfg.c
+++ b/hw/nvram/fw_cfg.c
@@ -27,6 +27,7 @@
 #include "system/system.h"
 #include "system/dma.h"
 #include "system/reset.h"
+#include "system/runstate.h"
 #include "system/address-spaces.h"
 #include "hw/core/boards.h"
 #include "hw/nvram/fw_cfg.h"
@@ -965,9 +966,8 @@ bool fw_cfg_add_file_from_generator(FWCfgState *s,
     return true;
 }
 
-static void fw_cfg_machine_reset(void *opaque)
+static void __fw_cfg_machine_reset(FWCfgState *s)
 {
-    FWCfgState *s = opaque;
     void *ptr;
     size_t len;
     char *buf;
@@ -981,10 +981,37 @@ static void fw_cfg_machine_reset(void *opaque)
     g_free(ptr);
 }
 
+static void fw_cfg_machine_reset(void *opaque)
+{
+    FWCfgState *s = opaque;
+
+    __fw_cfg_machine_reset(s);
+}
+
+static void fw_cfg_runstate_transition(Notifier *notifier, void *data)
+{
+    FWCfgState *s = container_of(notifier, FWCfgState,
+                                 runstate_transition);
+    const VMRunStateTransition *transition = data;
+
+    /*
+     * According to runstate_transitions_def[], RUN_STATE_PRELAUNCH is
+     * allowed to transition to four states. Only the transition to
+     * RUN_STATE_RUNNING requires updating "bootorder" and "bios-geometry".
+     */
+    if (transition->old_state == RUN_STATE_PRELAUNCH &&
+        transition->new_state == RUN_STATE_RUNNING) {
+        __fw_cfg_machine_reset(s);
+    }
+}
+
 static void fw_cfg_machine_ready(struct Notifier *n, void *data)
 {
     FWCfgState *s = container_of(n, FWCfgState, machine_ready);
+
     qemu_register_reset(fw_cfg_machine_reset, s);
+    s->runstate_transition.notify = fw_cfg_runstate_transition;
+    qemu_add_runstate_transition_notifier(&s->runstate_transition);
 }
 
 static const Property fw_cfg_properties[] = {
diff --git a/include/hw/nvram/fw_cfg.h b/include/hw/nvram/fw_cfg.h
index 56f17a0bdc..81ac940119 100644
--- a/include/hw/nvram/fw_cfg.h
+++ b/include/hw/nvram/fw_cfg.h
@@ -66,6 +66,7 @@ struct FWCfgState {
     uint16_t cur_entry;
     uint32_t cur_offset;
     Notifier machine_ready;
+    Notifier runstate_transition;
 
     bool dma_enabled;
     dma_addr_t dma_addr;
-- 
2.39.3


Reply via email to