Hey Michael,

The fix for this landed as 9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
upstream, present in v10.2.0-rc1 and released in v10.2.0. I'll add the
upstream link to the bug description.

** Description changed:

  [ Impact ]
  
  When running `block-stream` and `query-named-block-nodes` concurrently,
  a null-pointer dereference causes QEMU to segfault.
  
  The original reporter of this issue experienced the bug while performing
  concurrent libvirt `virDomainBlockPull` calls on the same VM/different
  disks. The race condition occurs at the end of the `block-stream` QMP;
  libvirt's handler for a completed `block-stream`
  (`qemuBlockJobProcessEventCompletedPull` [1]) calls `query-named-block-
  nodes` (see "libvirt trace" below for a full trace).
  
  This occurs in every version of QEMU shipped with Ubuntu, 22.04 thru
  25.10.
  
  [1] qemuBlockJobProcessEventCompletedPull
  
  [ Test Plan ]
  
  ```
  sudo apt install libvirt-daemon-system virtinst
  ```
  
  In `query-named-block-nodes.sh`:
  ```sh
  #!/bin/bash
  
  while true; do
-     virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null
+     virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null
  done
  ```
  
  In `blockrebase-crash.sh`:
  ```sh
  #!/bin/bash
  
  set -ex
  
  domain="$1"
  
  if [ -z "${domain}" ]; then
-     echo "Missing domain name"
-     exit 1
+     echo "Missing domain name"
+     exit 1
  fi
  
  ./query-named-block-nodes.sh "${domain}" &
  query_pid=$!
  
  while [ -n "$(virsh list --uuid)" ]; do
-     snap="snap0-$(uuidgen)"
- 
-     virsh snapshot-create-as "${domain}" \
-         --name "${snap}" \
-         --disk-only file= \
-         --diskspec vda,snapshot=no \
-         --diskspec 
"vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \
-         --atomic \
-         --no-metadata
- 
-     virsh blockpull "${domain}" vdb
- 
-     while bjr=$(virsh blockjob "$domain" vdb); do
-         if [[ "$bjr" == *"No current block job for"* ]] ; then
-             break;
-         fi;
-     done;
+     snap="snap0-$(uuidgen)"
+ 
+     virsh snapshot-create-as "${domain}" \
+         --name "${snap}" \
+         --disk-only file= \
+         --diskspec vda,snapshot=no \
+         --diskspec 
"vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \
+         --atomic \
+         --no-metadata
+ 
+     virsh blockpull "${domain}" vdb
+ 
+     while bjr=$(virsh blockjob "$domain" vdb); do
+         if [[ "$bjr" == *"No current block job for"* ]] ; then
+             break;
+         fi;
+     done;
  done
  
  kill "${query_pid}"
  ```
  
  `provision.sh` (`Ctrl + ]` after boot):
  ```sh
  #!/bin/bash
  
  set -ex
  
  wget https://cloud-images.ubuntu.com/daily/server/noble/current/noble-
  server-cloudimg-amd64.img
  
  sudo cp noble-server-cloudimg-amd64.img /var/lib/libvirt/images/n0-root.qcow2
  sudo qemu-img create -f qcow2 /var/lib/libvirt/images/n0-blk0.qcow2 10G
  
  touch network-config
  touch meta-data
  touch user-data
  
  virt-install \
-   -n n0 \
-   --description "Test noble minimal" \
-   --os-variant=ubuntu24.04 \
-   --ram=1024 --vcpus=2 \
-   --import \
-   --disk 
path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10
 \
-   --disk 
path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10
 \
-   --graphics none \
-   --network network=default \
-   --cloud-init 
user-data="user-data,meta-data=meta-data,network-config=network-config"
+   -n n0 \
+   --description "Test noble minimal" \
+   --os-variant=ubuntu24.04 \
+   --ram=1024 --vcpus=2 \
+   --import \
+   --disk 
path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10
 \
+   --disk 
path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10
 \
+   --graphics none \
+   --network network=default \
+   --cloud-init 
user-data="user-data,meta-data=meta-data,network-config=network-config"
  ```
  
  And run the script to cause the crash (you may need to manually kill
  query-named-block-jobs.sh):
  ```sh
  chmod 755 provision.sh blockrebase-crash.sh query-named-block-nodes.sh
  ./provision.sh
  ./blockrebase-crash n0
  ```
  
  Expected behavior: `blockrebase-crash.sh` runs until "No space left on
  device"
  
  Actual behavior: QEMU crashes after a few iterations:
  ```
  Block Pull: [81.05 %]+ bjr=
  + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
  ++ virsh blockjob n0 vdb
  Block Pull: [97.87 %]+ bjr=
  + [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
  ++ virsh blockjob n0 vdb
  error: Unable to read from monitor: Connection reset by peer
  error: Unable to read from monitor: Connection reset by peer
  + bjr=
  ++ virsh list --uuid
  + '[' -n 4eed8ba4-300b-4488-a520-510e5b544f57 ']'
  ++ uuidgen
  + snap=snap0-88be23e5-696c-445d-870a-abe5f7df56c0
  + virsh snapshot-create-as n0 --name 
snap0-88be23e5-696c-445d-870a-abe5f7df56c0 --disk-only file= --diskspec 
vda,snapshot=no --diskspec 
vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_snap0-88be23e5-696c-445d-870a-abe5f7df56c0.qcow2
 --atomic --no-metadata
  error: Requested operation is not valid: domain is not running
  Domain snapshot snap0-88be23e5-696c-445d-870a-abe5f7df56c0 created
  + virsh blockpull n0 vdb
  error: Requested operation is not valid: domain is not running
  error: Requested operation is not valid: domain is not running
  
  wesley@nv0:~$ error: Requested operation is not valid: domain is not running
  ```
  
  [ Where problems could occur ]
  
  The only codepaths affected by this change are `block-stream` and
  `blockdev-backup` [1][2]. If the code is somehow broken, we would expect
  to see failures when executing these QMP commands (or the libvirt APIs
  that use them, `virDomainBlockPull` and `virDomainBackupBegin` [3][4]).
  
  As noted in the upstream commit message, the change does cause an
  additional flush to occur during `blockdev-backup` QMPs.
  
  The patch that was ultimately merged upstream was a revert of most of
  [5]. _That_ patch was a workaround for a blockdev permissions issue that
  was later resolved in [6] (see the end of [7] and replies for upstream
  discussion). Both [5] and [6] are present in QEMU 6.2.0, so the
  assumptions that led us to the upstream solution hold for Jammy.
  
  [1] 
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.block-stream
  [2] 
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.blockdev-backup
  [3] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockPull
  [4] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin
  [5] https://gitlab.com/qemu-project/qemu/-/commit/3108a15cf09
  [6] https://gitlab.com/qemu-project/qemu/-/commit/3860c0201924d
  [7] https://lists.gnu.org/archive/html/qemu-devel/2025-10/msg06800.html
  
  [ Other info ]
  
  Backtrace from the coredump (source at [1]):
  ```
  #0  bdrv_refresh_filename (bs=0x5efed72f8350) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082
  #1  0x00005efea73cf9dc in bdrv_block_device_info (blk=0x0, bs=0x5efed72f8350, 
flat=true, errp=0x7ffeb829ebd8)
-     at block/qapi.c:62
+     at block/qapi.c:62
  #2  0x00005efea7391ed3 in bdrv_named_nodes_list (flat=<optimized out>, 
errp=0x7ffeb829ebd8)
-     at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
+     at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
  #3  0x00005efea7471993 in qmp_query_named_block_nodes (has_flat=<optimized 
out>, flat=<optimized out>,
-     errp=0x7ffeb829ebd8) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
+     errp=0x7ffeb829ebd8) at 
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
  #4  qmp_marshal_query_named_block_nodes (args=<optimized out>, 
ret=0x7f2b753beec0, errp=0x7f2b753beec8)
-     at qapi/qapi-commands-block-core.c:553
+     at qapi/qapi-commands-block-core.c:553
  #5  0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0) at 
qapi/qmp-dispatch.c:128
  #6  0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430) at util/async.c:219
  #7  0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430) at 
util/aio-posix.c:436
  #8  0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>, 
callback=<optimized out>,
-     user_data=<optimized out>) at util/async.c:361
+     user_data=<optimized out>) at util/async.c:361
  #9  0x00007f2b77809bfb in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
  #10 0x00007f2b77809e70 in g_main_context_dispatch () from 
/lib/x86_64-linux-gnu/libglib-2.0.so.0
  #11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287
  #12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310
  #13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589
  #14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905
  #15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0) at 
system/main.c:50
  #16 0x00005efea6e76319 in main (argc=<optimized out>, argv=<optimized out>) 
at system/main.c:93
  ```
  
  The libvirt logs suggest that the crash occurs right at the end of the 
blockjob, since it reaches "concluded" state before crashing. I assumed that 
this was one of:
  - `stream_clean` is freeing/modifying the `cor_filter_bs` without holding a 
lock that it needs to [2][3]
  - `bdrv_refresh_filename` needs to handle the possibility that the QLIST of 
children for a filter bs could be NULL [1]
  
  Ultimately the fix was neither of these [4]; `bdrv_refresh_filename`
  should not be able to observe a NULL list of children.
  
  `query-named-block-nodes` iterates the global list of block nodes
  `graph_bdrv_states` [5]. The offending block node (the `cor_filter_bs`,
  added during a `block-stream`) was removed from the list of block nodes
  _for the disk_ when the operation finished, but not removed from the
  global list of block nodes until later (this is the window for the
  race). The patch keeps the block node in the disk's list until it is
  dropped at the end of the blockjob.
  
  [1] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block.c?h=ubuntu/questing-devel#n8071
  [2] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n131
  [3] 
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n340
  [4] 
https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
  [5] 
https://gitlab.com/qemu-project/qemu/-/blob/v10.1.0/block.c?ref_type=tags#L72
+ [6] 
https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
  
  [ libvirt trace ]
  `qemuBlockJobProcessEventCompletedPull` [1]
  `qemuBlockJobProcessEventCompletedPullBitmaps` [2]
  `qemuBlockGetNamedNodeData` [3]
  `qemuMonitorBlockGetNamedNodeData` [4]
  `qemuMonitorJSONBlockGetNamedNodeData` [5]
  `qemuMonitorJSONQueryNamedBlockNodes` [6]
  
  [1] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n870
  [2] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n807
  [3] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_block.c?h=applied/ubuntu/questing-devel#n2925
  [4] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor.c?h=applied/ubuntu/questing-devel#n2039
  [5] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2816
  [6] 
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2159

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2126951

Title:
  `block-stream` segfault with concurrent `query-named-block-nodes`

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/2126951/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to