Hey Michael,
The fix for this landed as 9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
upstream, present in v10.2.0-rc1 and released in v10.2.0. I'll add the
upstream link to the bug description.
** Description changed:
[ Impact ]
When running `block-stream` and `query-named-block-nodes` concurrently,
a null-pointer dereference causes QEMU to segfault.
The original reporter of this issue experienced the bug while performing
concurrent libvirt `virDomainBlockPull` calls on the same VM/different
disks. The race condition occurs at the end of the `block-stream` QMP;
libvirt's handler for a completed `block-stream`
(`qemuBlockJobProcessEventCompletedPull` [1]) calls `query-named-block-
nodes` (see "libvirt trace" below for a full trace).
This occurs in every version of QEMU shipped with Ubuntu, 22.04 thru
25.10.
[1] qemuBlockJobProcessEventCompletedPull
[ Test Plan ]
```
sudo apt install libvirt-daemon-system virtinst
```
In `query-named-block-nodes.sh`:
```sh
#!/bin/bash
while true; do
- virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null
+ virsh qemu-monitor-command "$1" query-named-block-nodes > /dev/null
done
```
In `blockrebase-crash.sh`:
```sh
#!/bin/bash
set -ex
domain="$1"
if [ -z "${domain}" ]; then
- echo "Missing domain name"
- exit 1
+ echo "Missing domain name"
+ exit 1
fi
./query-named-block-nodes.sh "${domain}" &
query_pid=$!
while [ -n "$(virsh list --uuid)" ]; do
- snap="snap0-$(uuidgen)"
-
- virsh snapshot-create-as "${domain}" \
- --name "${snap}" \
- --disk-only file= \
- --diskspec vda,snapshot=no \
- --diskspec
"vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \
- --atomic \
- --no-metadata
-
- virsh blockpull "${domain}" vdb
-
- while bjr=$(virsh blockjob "$domain" vdb); do
- if [[ "$bjr" == *"No current block job for"* ]] ; then
- break;
- fi;
- done;
+ snap="snap0-$(uuidgen)"
+
+ virsh snapshot-create-as "${domain}" \
+ --name "${snap}" \
+ --disk-only file= \
+ --diskspec vda,snapshot=no \
+ --diskspec
"vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_${snap}.qcow2" \
+ --atomic \
+ --no-metadata
+
+ virsh blockpull "${domain}" vdb
+
+ while bjr=$(virsh blockjob "$domain" vdb); do
+ if [[ "$bjr" == *"No current block job for"* ]] ; then
+ break;
+ fi;
+ done;
done
kill "${query_pid}"
```
`provision.sh` (`Ctrl + ]` after boot):
```sh
#!/bin/bash
set -ex
wget https://cloud-images.ubuntu.com/daily/server/noble/current/noble-
server-cloudimg-amd64.img
sudo cp noble-server-cloudimg-amd64.img /var/lib/libvirt/images/n0-root.qcow2
sudo qemu-img create -f qcow2 /var/lib/libvirt/images/n0-blk0.qcow2 10G
touch network-config
touch meta-data
touch user-data
virt-install \
- -n n0 \
- --description "Test noble minimal" \
- --os-variant=ubuntu24.04 \
- --ram=1024 --vcpus=2 \
- --import \
- --disk
path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10
\
- --disk
path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10
\
- --graphics none \
- --network network=default \
- --cloud-init
user-data="user-data,meta-data=meta-data,network-config=network-config"
+ -n n0 \
+ --description "Test noble minimal" \
+ --os-variant=ubuntu24.04 \
+ --ram=1024 --vcpus=2 \
+ --import \
+ --disk
path=/var/lib/libvirt/images/n0-root.qcow2,bus=virtio,cache=writethrough,size=10
\
+ --disk
path=/var/lib/libvirt/images/n0-blk0.qcow2,bus=virtio,cache=writethrough,size=10
\
+ --graphics none \
+ --network network=default \
+ --cloud-init
user-data="user-data,meta-data=meta-data,network-config=network-config"
```
And run the script to cause the crash (you may need to manually kill
query-named-block-jobs.sh):
```sh
chmod 755 provision.sh blockrebase-crash.sh query-named-block-nodes.sh
./provision.sh
./blockrebase-crash n0
```
Expected behavior: `blockrebase-crash.sh` runs until "No space left on
device"
Actual behavior: QEMU crashes after a few iterations:
```
Block Pull: [81.05 %]+ bjr=
+ [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
++ virsh blockjob n0 vdb
Block Pull: [97.87 %]+ bjr=
+ [[ '' == *\N\o\ \c\u\r\r\e\n\t\ \b\l\o\c\k\ \j\o\b\ \f\o\r* ]]
++ virsh blockjob n0 vdb
error: Unable to read from monitor: Connection reset by peer
error: Unable to read from monitor: Connection reset by peer
+ bjr=
++ virsh list --uuid
+ '[' -n 4eed8ba4-300b-4488-a520-510e5b544f57 ']'
++ uuidgen
+ snap=snap0-88be23e5-696c-445d-870a-abe5f7df56c0
+ virsh snapshot-create-as n0 --name
snap0-88be23e5-696c-445d-870a-abe5f7df56c0 --disk-only file= --diskspec
vda,snapshot=no --diskspec
vdb,stype=file,file=/var/lib/libvirt/images/n0-blk0_snap0-88be23e5-696c-445d-870a-abe5f7df56c0.qcow2
--atomic --no-metadata
error: Requested operation is not valid: domain is not running
Domain snapshot snap0-88be23e5-696c-445d-870a-abe5f7df56c0 created
+ virsh blockpull n0 vdb
error: Requested operation is not valid: domain is not running
error: Requested operation is not valid: domain is not running
wesley@nv0:~$ error: Requested operation is not valid: domain is not running
```
[ Where problems could occur ]
The only codepaths affected by this change are `block-stream` and
`blockdev-backup` [1][2]. If the code is somehow broken, we would expect
to see failures when executing these QMP commands (or the libvirt APIs
that use them, `virDomainBlockPull` and `virDomainBackupBegin` [3][4]).
As noted in the upstream commit message, the change does cause an
additional flush to occur during `blockdev-backup` QMPs.
The patch that was ultimately merged upstream was a revert of most of
[5]. _That_ patch was a workaround for a blockdev permissions issue that
was later resolved in [6] (see the end of [7] and replies for upstream
discussion). Both [5] and [6] are present in QEMU 6.2.0, so the
assumptions that led us to the upstream solution hold for Jammy.
[1]
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.block-stream
[2]
https://qemu-project.gitlab.io/qemu/interop/qemu-qmp-ref.html#command-QMP-block-core.blockdev-backup
[3] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBlockPull
[4] https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainBackupBegin
[5] https://gitlab.com/qemu-project/qemu/-/commit/3108a15cf09
[6] https://gitlab.com/qemu-project/qemu/-/commit/3860c0201924d
[7] https://lists.gnu.org/archive/html/qemu-devel/2025-10/msg06800.html
[ Other info ]
Backtrace from the coredump (source at [1]):
```
#0 bdrv_refresh_filename (bs=0x5efed72f8350) at
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082
#1 0x00005efea73cf9dc in bdrv_block_device_info (blk=0x0, bs=0x5efed72f8350,
flat=true, errp=0x7ffeb829ebd8)
- at block/qapi.c:62
+ at block/qapi.c:62
#2 0x00005efea7391ed3 in bdrv_named_nodes_list (flat=<optimized out>,
errp=0x7ffeb829ebd8)
- at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
+ at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
#3 0x00005efea7471993 in qmp_query_named_block_nodes (has_flat=<optimized
out>, flat=<optimized out>,
- errp=0x7ffeb829ebd8) at
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
+ errp=0x7ffeb829ebd8) at
/usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
#4 qmp_marshal_query_named_block_nodes (args=<optimized out>,
ret=0x7f2b753beec0, errp=0x7f2b753beec8)
- at qapi/qapi-commands-block-core.c:553
+ at qapi/qapi-commands-block-core.c:553
#5 0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0) at
qapi/qmp-dispatch.c:128
#6 0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430) at util/async.c:219
#7 0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430) at
util/aio-posix.c:436
#8 0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>,
callback=<optimized out>,
- user_data=<optimized out>) at util/async.c:361
+ user_data=<optimized out>) at util/async.c:361
#9 0x00007f2b77809bfb in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#10 0x00007f2b77809e70 in g_main_context_dispatch () from
/lib/x86_64-linux-gnu/libglib-2.0.so.0
#11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287
#12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310
#13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589
#14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905
#15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0) at
system/main.c:50
#16 0x00005efea6e76319 in main (argc=<optimized out>, argv=<optimized out>)
at system/main.c:93
```
The libvirt logs suggest that the crash occurs right at the end of the
blockjob, since it reaches "concluded" state before crashing. I assumed that
this was one of:
- `stream_clean` is freeing/modifying the `cor_filter_bs` without holding a
lock that it needs to [2][3]
- `bdrv_refresh_filename` needs to handle the possibility that the QLIST of
children for a filter bs could be NULL [1]
Ultimately the fix was neither of these [4]; `bdrv_refresh_filename`
should not be able to observe a NULL list of children.
`query-named-block-nodes` iterates the global list of block nodes
`graph_bdrv_states` [5]. The offending block node (the `cor_filter_bs`,
added during a `block-stream`) was removed from the list of block nodes
_for the disk_ when the operation finished, but not removed from the
global list of block nodes until later (this is the window for the
race). The patch keeps the block node in the disk's list until it is
dropped at the end of the blockjob.
[1]
https://git.launchpad.net/ubuntu/+source/qemu/tree/block.c?h=ubuntu/questing-devel#n8071
[2]
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n131
[3]
https://git.launchpad.net/ubuntu/+source/qemu/tree/block/stream.c?h=ubuntu/questing-devel#n340
[4]
https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
[5]
https://gitlab.com/qemu-project/qemu/-/blob/v10.1.0/block.c?ref_type=tags#L72
+ [6]
https://gitlab.com/qemu-project/qemu/-/commit/9dbfd4e28dd11a83f54c371fade8d49a63d6dc1e
[ libvirt trace ]
`qemuBlockJobProcessEventCompletedPull` [1]
`qemuBlockJobProcessEventCompletedPullBitmaps` [2]
`qemuBlockGetNamedNodeData` [3]
`qemuMonitorBlockGetNamedNodeData` [4]
`qemuMonitorJSONBlockGetNamedNodeData` [5]
`qemuMonitorJSONQueryNamedBlockNodes` [6]
[1]
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n870
[2]
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_blockjob.c?h=applied/ubuntu/questing-devel#n807
[3]
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_block.c?h=applied/ubuntu/questing-devel#n2925
[4]
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor.c?h=applied/ubuntu/questing-devel#n2039
[5]
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2816
[6]
https://git.launchpad.net/ubuntu/+source/libvirt/tree/src/qemu/qemu_monitor_json.c?h=applied/ubuntu/questing-devel#n2159
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2126951
Title:
`block-stream` segfault with concurrent `query-named-block-nodes`
To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/2126951/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs