Kevin Wolf <[email protected]> writes:
> Am 28.05.2026 um 15:55 hat Fabiano Rosas geschrieben:
>> It's not possible to access the image file while there is an incoming
>> migration in progress, the QEMU process doesn't hold any locks to the
>> storage at this point so nodes are inactive. Attempting to drain leads
>> to an assert at bdrv_co_write_req_prepare():
>>
>> assert(!(bs->open_flags & BDRV_O_INACTIVE))
>>
>> The issue is reproducible by running iotest 181 on a host under cpu
>> load. The migration must coincide with the header already containing
>> the QED_F_NEED_CHECK flag.
>>
>> The sequence of events is as follows, with the respective call stacks
>> referenced below:
>>
>> During block device init, bdrv_qed_attach_aio_context() starts the
>> 'need_check' timer. The timer will not fire during incoming migration
>> as it uses QEMU_CLOCK_VIRTUAL (to avoid this very issue, as the code
>> comment indicates). (0)
>>
>> However, there's still bdrv_qed_drain_begin() which uses the fact that
>> the timer is live to decide whether to start the
>> qed_need_check_timer_entry() directly. (1)
>>
>> The qed_need_check_timer_entry() eventually calls into
>> qed_write_header() -> bdrv_co_pwrite() leading to the assert. (2)
>>
>> Since we don't have an API for checking if a timer is enabled, an
>> alternative is to skip this logic whenever the runstate is
>> INMIGRATE. This actually matches the setting of the BDRV_O_INACTIVE
>> flag at blockdev.c.
>
> I'm not sure what you mean by "checking if a timer is enabled", because
> bdrv_qed_drain_begin() has a timer_pending() check that sounds exactly
> what you say doesn't exist.
>
Sorry, I meant to say checking if a *clock* is enabled.
The code comment at qed_start_need_check_timer() says:
/* Use QEMU_CLOCK_VIRTUAL so we don't alter the image file while suspended
for
* migration.
*/
It seems to assume that since QEMU_CLOCK_VIRTUAL is disabled when vcpus
are paused then the timer function will never execute while the image is
inactive. If bdrv_qed_drain_begin() could check clock->enabled, then it
would be under the same assumption.
> But it's also not clear to me why s->need_check_timer is true on an
> inactive image. I don't think this is supposed to be the case?
bdrv_qed_do_open() reads the header during QEMU startup. Whatever is set
in the header at that moment will be picked up, including
QED_F_NEED_CHECK, which is what the code uses to determine the timer
should be started.
> It looks like qed is missing a .bdrv_inactivate implementation that
> cancels the timer after flushing everything (including the header) to
> disk.
>
Hmm, but bdrv_inactivate happens on the migration source. The
destination machine could still start at any moment while the image is
dirty and see the QED_F_NEED_CHECK flag. Even doing something on the
destination machine wouldn't help I think given how early the crash
happens.
>>
>> The stacks:
>>
>> (0) == issues timer_mod ==
>> #6 in qed_start_need_check_timer at ../block/qed.c:340
>> #7 in bdrv_qed_attach_aio_context at ../block/qed.c:373
>> #8 in bdrv_qed_do_open at ../block/qed.c:556
>> #9 in bdrv_qed_open_entry at ../block/qed.c:582
>> #10 in coroutine_trampoline at ../util/coroutine-ucontext.c:175
>> #0 in qemu_coroutine_switch<+120> at ../util/coroutine-ucontext.c:321
>> #1 in qemu_aio_coroutine_enter<+356> at ../util/qemu-coroutine.c:293
>> #2 in aio_co_enter<+179> at ../util/async.c:710
>> #3 in aio_co_wake<+53> at ../util/async.c:695
>> #4 in thread_pool_co_cb<+47> at ../util/thread-pool.c:283
>> #5 in thread_pool_completion_bh<+241> at ../util/thread-pool.c:202
>> #6 in aio_bh_call<+109> at ../util/async.c:173
>> #7 in aio_bh_poll<+299> at ../util/async.c:220
>> #8 in aio_poll<+690> at ../util/aio-posix.c:745
>> #9 in bdrv_qed_open<+392> at ../block/qed.c:607
>> #10 in bdrv_open_driver<+327> at ../block.c:1678
>> #11 in bdrv_open_common<+1619> at ../block.c:2008
>> #12 in bdrv_open_inherit<+2556> at ../block.c:4191
>> #13 in bdrv_open<+118> at ../block.c:4286
>> #14 in blk_new_open<+199> at ../block/block-backend.c:458
>> #15 in blockdev_init<+2011> at ../blockdev.c:612
>> #16 in drive_new<+3008> at ../blockdev.c:1008
>> #17 in drive_init_func<+51> at ../system/vl.c:662
>> #18 in qemu_opts_foreach<+227> at ../util/qemu-option.c:1148
>> #19 in configure_blockdev<+350> at ../system/vl.c:721
>> #20 in qemu_create_early_backends<+343> at ../system/vl.c:2076
>> #21 in qemu_init<+12483> at ../system/vl.c:3778
>> #22 in main<+46> at ../system/main.c:71
>>
>> (1) == sees timer_pending ==
>> #6 in bdrv_qed_drain_begin at ../block/qed.c:391
>> #7 in bdrv_do_drained_begin at ../block/io.c:366
>> #8 in bdrv_do_drained_begin_quiesce at ../block/io.c:386
>> #9 in bdrv_child_cb_drained_begin at ../block.c:1207
>> #10 in bdrv_parent_drained_begin_single at ../block/io.c:133
>> #11 in bdrv_parent_drained_begin at ../block/io.c:64
>> #12 in bdrv_do_drained_begin at ../block/io.c:364
>> #13 in bdrv_drained_begin at ../block/io.c:393
>> #14 in blk_drain at ../block/block-backend.c:2101
>> #15 in blk_unref at ../block/block-backend.c:544
>> #16 in bdrv_open_inherit at ../block.c:4197
>> #17 in bdrv_open at ../block.c:4286
>> #18 in blk_new_open at ../block/block-backend.c:458
>> #19 in blockdev_init at ../blockdev.c:612
>> #20 in drive_new at ../blockdev.c:1008
>> #21 in drive_init_func at ../system/vl.c:662
>> #22 in qemu_opts_foreach at ../util/qemu-option.c:1148
>> #23 in configure_blockdev at ../system/vl.c:721
>> #24 in qemu_create_early_backends at ../system/vl.c:2076
>> #25 in qemu_init at ../system/vl.c:3778
>> #26 in main at ../system/main.c:71
>>
>> (2) == crashes ==
>> #5 in __assert_fail (assertion="!(bs->open_flags & BDRV_O_INACTIVE)",
>> file="../block/io.c", line=1977
>> #6 in bdrv_co_write_req_prepare at ../block/io.c:1977
>> #7 in bdrv_aligned_pwritev at ../block/io.c:2099
>> #8 in bdrv_co_pwritev_part at ../block/io.c:2316
>> #9 in bdrv_co_pwritev at ../block/io.c:2233
>> #10 in bdrv_co_pwrite at ../include/block/block_int-io.h:77
>> #11 in qed_write_header at ../block/qed.c:128
>> #12 in qed_need_check_timer at ../block/qed.c:305
>> #13 in qed_need_check_timer_entry at ../block/qed.c:319
>>
>> Note that this issue is not exactly the same as what's been reported
>> in Gitlab, but given how easily this reproduces, I imagine it has to
>> be happening in that setup as well.
>>
>> Link: https://gitlab.com/qemu-project/qemu/-/work_items/3515
>> Signed-off-by: Fabiano Rosas <[email protected]>
>> ---
>> CI run: https://gitlab.com/farosas/qemu/-/pipelines/2557314306
>> ---
>> block/qed.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/block/qed.c b/block/qed.c
>> index da23a83d62..ccf32bb4ae 100644
>> --- a/block/qed.c
>> +++ b/block/qed.c
>> @@ -21,6 +21,7 @@
>> #include "qemu/module.h"
>> #include "qemu/option.h"
>> #include "qemu/memalign.h"
>> +#include "system/runstate.h"
>> #include "trace.h"
>> #include "qed.h"
>> #include "system/block-backend.h"
>> @@ -373,6 +374,11 @@ static void bdrv_qed_drain_begin(BlockDriverState *bs)
>> {
>> BDRVQEDState *s = bs->opaque;
>>
>> + /* Nodes are inactive while waiting for an incoming migration. */
>> + if (runstate_check(RUN_STATE_INMIGRATE)) {
>
> If you're interested in whether the node is inactive, this is what you
> should check. RUN_STATE_INMIGRATE is only one of the conditions in which
> it might (but doesn't have to) be true. There is bdrv_is_inactive() to
> check this.
>
> Of course, if we're fixing the root cause (that QED_F_NEED_CHECK and
> s->need_check_timer are set on an inactive image), this hunk might not
> be needed at all.
>
>> + return;
>> + }
>> +
>> /* Fire the timer immediately in order to start doing I/O as soon as the
>> * header is flushed.
>> */
>
> Kevin