Hi,

Am 03.02.26 um 3:24 PM schrieb Jean-Louis Dupond:
> Hi,
> 
> Since some months we were observing disk corruption within the VM when
> enabling backups (which triggers snapshots).
> After a lot of troubleshooting, we were able to track down the commit
> that caused it:
> https://gitlab.com/qemu-project/qemu/-/
> commit/058cfca5645a9ed7cb2bdb77d15f2eacaf343694
> 
> More info in the issue:
> https://gitlab.com/qemu-project/qemu/-/issues/3273
> 
> Now this seems to be caused by a race between disabling the
> dirty_bitmaps and the tracking implemented in the mirror top layer.
> Kevin shared me a possible solution:
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index b344182c747..f76e43f22c1 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1122,6 +1122,9 @@ static int coroutine_fn mirror_run(Job *job, Error 
> **errp)
>       * accessing it.
>       */
>      mirror_top_opaque->job = s;
> +    if (s->copy_mode != MIRROR_COPY_MODE_WRITE_BLOCKING) {
> +        bdrv_disable_dirty_bitmap(s->dirty_bitmap);
> +    }
>  
>      assert(!s->dbi);
>      s->dbi = bdrv_dirty_iter_new(s->dirty_bitmap);
> @@ -2018,7 +2021,9 @@ static BlockJob *mirror_start_job(
>       * The dirty bitmap is set by bdrv_mirror_top_do_write() when not in 
> active
>       * mode.
>       */
> -    bdrv_disable_dirty_bitmap(s->dirty_bitmap);
> +    if (s->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING) {
> +        bdrv_disable_dirty_bitmap(s->dirty_bitmap);
> +    }
>  
>      bdrv_graph_wrlock_drained();
>      ret = block_job_add_bdrv(&s->common, "source", bs, 0,
> 
> 
> Running this for some hours already, and it seems to fix the issue.
> 
> Let's open up the discussion if this is the proper way to fix it, or if
> there are better alternatives :)

Yes, I do think this is the proper solution. I ended up with essentially
the same yesterday (see the Gitlab issue). I moved the disabling
unconditionally, but there should be no need to delay the disabling if
using write-blocking mode. I would suggest moving the comment on top of
the changed hunk in mirror_start_job() along to the new hunk in
mirror_run(). With that, consider the change:

Reviewed-by: Fiona Ebner <[email protected]>

Best Regards,
Fiona


Reply via email to