Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-05 Thread Fiona Ebner
Am 04.06.24 um 17:28 schrieb Kevin Wolf:
> Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
>> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
>>> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
 Am 26.03.24 um 13:44 schrieb Kevin Wolf:
>
> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> with a coroutine wrapper so that the graph lock is held for the whole
> function. Then calling bdrv_co_flush() while iterating the list is safe
> and doesn't allow concurrent graph modifications.

 The second is that iotest 255 ran into an assertion failure upon QMP 
 'quit':

> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> `!qemu_in_coroutine()' failed.

 Looking at the backtrace:

> #5  0x762a90cc3eb2 in __GI___assert_fail
> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> at ./assert/assert.c:101
> #6  0x5afb07585311 in bdrv_graph_wrlock () at 
> ../block/graph-lock.c:113
> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> ../block/block-backend.c:901
> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> ../block/block-backend.c:487
> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> ../block/block-backend.c:547
> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> ../block/block-backend.c:618
> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> at block/block-gen.c:1391
> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)

 So I guess calling bdrv_next() is not safe from a coroutine, because
 the function doing the iteration could end up being the last thing to
 have a reference for the BB.
>>>
>>> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
>>> is surprising, because while we hold the graph lock, no reference should
>>> be able to go away - you need the writer lock for that and you won't get
>>> it as long as bdrv_co_flush_all() locks the graph. So whatever had a
>>> reference before the bdrv_next() loop must still have it now. Do you
>>> know where it gets dropped?
>>>
>>
>> AFAICT, yes, it does hold the graph reader lock. The generated code is:
>>
>>> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
>>> {
>>> BdrvFlushAll *s = opaque;
>>>
>>> bdrv_graph_co_rdlock();
>>> s->ret = bdrv_co_flush_all();
>>> bdrv_graph_co_rdunlock();
>>> s->poll_state.in_progress = false;
>>>
>>> aio_wait_kick();
>>> }
>>
>> Apparently when the mirror job is aborted/exits, which can happen during
>> the polling for bdrv_co_flush_all_entry(), a reference can go away
>> without the write lock (at least my breakpoints didn't trigger) being held:
>>
>>> #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
>>> #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
>>> ../block/mirror.c:710
>>> #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
>>> ../block/mirror.c:823
>>> #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
>>> #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) 
>>> at ../job.c:855
>>> #5  0x5cdefb223852 in job_completed_txn_abort_locked 
>>> (job=0x5cdefeb53000) at ../job.c:958
>>> #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
>>> ../job.c:1065
>>> #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
>>> #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
>>> ../util/async.c:171
>>> #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
>>> ../util/async.c:218
>>> #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at 
>>> ../util/aio-posix.c:722
>>> #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at 
>>> ../block/block-gen.h:43
>>> #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
>>> #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, 
>>> send_stop=false)
>>> at ../system/cpus.c:297
>>> #14 0x5cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
>>> #15 0x5cdefae6d892 in qemu_cleanup (status=0) at 
>>> ../system/runstate.c:871
>>> #16 0x5cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
>>> #17 0x5cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at 
>>> ../system/main.c:48
>>
>> Looking at the code in mirror_exit_common(), it doesn't seem to acquire
>> a write lock:
>>
>>> bdrv_graph_rdunlock_main_loop();
>>>
>>> /*
>>>  * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>>>  * inserting target_bs at s->to_replace, where we might not be able to 
>>> get
>>>  * these 

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-04 Thread Kevin Wolf
Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> > Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> >> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> >>>
> >>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> >>> with a coroutine wrapper so that the graph lock is held for the whole
> >>> function. Then calling bdrv_co_flush() while iterating the list is safe
> >>> and doesn't allow concurrent graph modifications.
> >>
> >> The second is that iotest 255 ran into an assertion failure upon QMP 
> >> 'quit':
> >>
> >>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> >>> `!qemu_in_coroutine()' failed.
> >>
> >> Looking at the backtrace:
> >>
> >>> #5  0x762a90cc3eb2 in __GI___assert_fail
> >>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> >>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> >>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> >>> at ./assert/assert.c:101
> >>> #6  0x5afb07585311 in bdrv_graph_wrlock () at 
> >>> ../block/graph-lock.c:113
> >>> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:901
> >>> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:487
> >>> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> >>> ../block/block-backend.c:547
> >>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> >>> ../block/block-backend.c:618
> >>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> >>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> >>> at block/block-gen.c:1391
> >>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
> >>
> >> So I guess calling bdrv_next() is not safe from a coroutine, because
> >> the function doing the iteration could end up being the last thing to
> >> have a reference for the BB.
> > 
> > Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> > is surprising, because while we hold the graph lock, no reference should
> > be able to go away - you need the writer lock for that and you won't get
> > it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> > reference before the bdrv_next() loop must still have it now. Do you
> > know where it gets dropped?
> > 
> 
> AFAICT, yes, it does hold the graph reader lock. The generated code is:
> 
> > static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> > {
> > BdrvFlushAll *s = opaque;
> > 
> > bdrv_graph_co_rdlock();
> > s->ret = bdrv_co_flush_all();
> > bdrv_graph_co_rdunlock();
> > s->poll_state.in_progress = false;
> > 
> > aio_wait_kick();
> > }
> 
> Apparently when the mirror job is aborted/exits, which can happen during
> the polling for bdrv_co_flush_all_entry(), a reference can go away
> without the write lock (at least my breakpoints didn't trigger) being held:
> 
> > #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> > #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
> > ../block/mirror.c:710
> > #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
> > ../block/mirror.c:823
> > #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> > #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) 
> > at ../job.c:855
> > #5  0x5cdefb223852 in job_completed_txn_abort_locked 
> > (job=0x5cdefeb53000) at ../job.c:958
> > #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
> > ../job.c:1065
> > #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> > #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
> > ../util/async.c:171
> > #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
> > ../util/async.c:218
> > #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at 
> > ../util/aio-posix.c:722
> > #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at 
> > ../block/block-gen.h:43
> > #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> > #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, 
> > send_stop=false)
> > at ../system/cpus.c:297
> > #14 0x5cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
> > #15 0x5cdefae6d892 in qemu_cleanup (status=0) at 
> > ../system/runstate.c:871
> > #16 0x5cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
> > #17 0x5cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at 
> > ../system/main.c:48
> 
> Looking at the code in mirror_exit_common(), it doesn't seem to acquire
> a write lock:
> 
> > bdrv_graph_rdunlock_main_loop();
> > 
> > /*
> >  * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
> >  * inserting target_bs at s->to_replace, where we might not be able to 
> > get
> >  * these permissions.
> >  */
> > 

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-04 Thread Fiona Ebner
Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
>> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
>>>
>>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
>>> with a coroutine wrapper so that the graph lock is held for the whole
>>> function. Then calling bdrv_co_flush() while iterating the list is safe
>>> and doesn't allow concurrent graph modifications.
>>
>> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
>>
>>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
>>> `!qemu_in_coroutine()' failed.
>>
>> Looking at the backtrace:
>>
>>> #5  0x762a90cc3eb2 in __GI___assert_fail
>>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
>>> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
>>> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
>>> at ./assert/assert.c:101
>>> #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
>>> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
>>> ../block/block-backend.c:901
>>> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
>>> ../block/block-backend.c:487
>>> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
>>> ../block/block-backend.c:547
>>> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
>>> ../block/block-backend.c:618
>>> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
>>> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
>>> at block/block-gen.c:1391
>>> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
>>
>> So I guess calling bdrv_next() is not safe from a coroutine, because
>> the function doing the iteration could end up being the last thing to
>> have a reference for the BB.
> 
> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> is surprising, because while we hold the graph lock, no reference should
> be able to go away - you need the writer lock for that and you won't get
> it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> reference before the bdrv_next() loop must still have it now. Do you
> know where it gets dropped?
> 

AFAICT, yes, it does hold the graph reader lock. The generated code is:

> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> {
> BdrvFlushAll *s = opaque;
> 
> bdrv_graph_co_rdlock();
> s->ret = bdrv_co_flush_all();
> bdrv_graph_co_rdunlock();
> s->poll_state.in_progress = false;
> 
> aio_wait_kick();
> }

Apparently when the mirror job is aborted/exits, which can happen during
the polling for bdrv_co_flush_all_entry(), a reference can go away
without the write lock (at least my breakpoints didn't trigger) being held:

> #0  blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> #1  0x5cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at 
> ../block/mirror.c:710
> #2  0x5cdefb263575 in mirror_abort (job=0x5cdefeb53000) at 
> ../block/mirror.c:823
> #3  0x5cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> #4  0x5cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) at 
> ../job.c:855
> #5  0x5cdefb223852 in job_completed_txn_abort_locked (job=0x5cdefeb53000) 
> at ../job.c:958
> #6  0x5cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at 
> ../job.c:1065
> #7  0x5cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> #8  0x5cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at 
> ../util/async.c:171
> #9  0x5cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at 
> ../util/async.c:218
> #10 0x5cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at 
> ../util/aio-posix.c:722
> #11 0x5cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at 
> ../block/block-gen.h:43
> #12 0x5cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> #13 0x5cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, 
> send_stop=false)
> at ../system/cpus.c:297
> #14 0x5cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
> #15 0x5cdefae6d892 in qemu_cleanup (status=0) at ../system/runstate.c:871
> #16 0x5cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
> #17 0x5cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at 
> ../system/main.c:48

Looking at the code in mirror_exit_common(), it doesn't seem to acquire
a write lock:

> bdrv_graph_rdunlock_main_loop();
> 
> /*
>  * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>  * inserting target_bs at s->to_replace, where we might not be able to get
>  * these permissions.
>  */
> blk_unref(s->target);
> s->target = NULL;

The write lock is taken in blk_remove_bs() when the refcount drops to 0
and the BB is actually removed:

> bdrv_graph_wrlock();
> bdrv_root_unref_child(root);
> bdrv_graph_wrunlock();

Best Regards,
Fiona




Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-03 Thread Kevin Wolf
Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> Hi Kevin,
> 
> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> > Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> >> The old_bs variable in bdrv_next() is currently determined by looking
> >> at the old block backend. However, if the block graph changes before
> >> the next bdrv_next() call, it might be that the associated BDS is not
> >> the same that was referenced previously. In that case, the wrong BDS
> >> is unreferenced, leading to an assertion failure later:
> >>
> >>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
> > 
> > Your change makes sense, but in theory it shouldn't make a difference.
> > The real bug is in the caller, you can't allow graph modifications while
> > iterating the list of nodes. Even if it doesn't crash (like after your
> > patch), you don't have any guarantee that you will have seen every node
> > that exists that the end - and maybe not even that you don't get the
> > same node twice.
> > 
> >> In particular, this can happen in the context of bdrv_flush_all(),
> >> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> >> a graph change (for example with a stream block job [0]).
> > 
> > The whole locking around this case is a bit tricky and would deserve
> > some cleanup.
> > 
> > The basic rule for bdrv_next() callers is that they need to hold the
> > graph reader lock as long as they are iterating the graph, otherwise
> > it's not safe. This implies that the ref/unref pairs in it should never
> > make a difference either - which is important, because at least
> > releasing the last reference is forbidden while holding the graph lock.
> > I intended to remove the ref/unref for bdrv_next(), but I didn't because
> > I realised that the callers need to be audited first that they really
> > obey the rules. You found one that would be problematic.
> > 
> > The thing that bdrv_flush_all() gets wrong is that it promises to follow
> > the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> > something that polls. The compiler can't catch this because bdrv_flush()
> > is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
> > 
> > - If called outside of coroutine context, they are GRAPH_UNLOCKED
> > - If called in coroutine context, they are GRAPH_RDLOCK
> > 
> > We should probably try harder to get rid of the mixed functions, because
> > a synchronous co_wrapper_bdrv_rdlock could actually be marked
> > GRAPH_UNLOCKED in the code and then the compiler could catch this case.
> > 
> > The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> > with a coroutine wrapper so that the graph lock is held for the whole
> > function. Then calling bdrv_co_flush() while iterating the list is safe
> > and doesn't allow concurrent graph modifications.
> 
> I attempted this now, but ran into two issues:
> 
> The first is that I had to add support for a function without arguments
> to the block-coroutine-wrapper.py script. I can send this as an RFC in
> any case if desired.

I think I wouldn't necessarily merge it if we don't have code that makes
use of it, but having it on the list as an RFC can't hurt either way.
Then we can come back to it once we do need it for something.

> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
> 
> > ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> > `!qemu_in_coroutine()' failed.
> 
> Looking at the backtrace:
> 
> > #5  0x762a90cc3eb2 in __GI___assert_fail
> > (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> > "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> > <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> > at ./assert/assert.c:101
> > #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> > #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:901
> > #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:487
> > #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> > ../block/block-backend.c:547
> > #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> > ../block/block-backend.c:618
> > #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> > #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) 
> > at block/block-gen.c:1391
> > #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
> 
> So I guess calling bdrv_next() is not safe from a coroutine, because
> the function doing the iteration could end up being the last thing to
> have a reference for the BB.

Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
is surprising, because while we hold the graph lock, no reference should
be able to go away - you need the writer lock for that and you won't get
it as long as bdrv_co_flush_all() locks the graph. So whatever had a
reference before the bdrv_next() loop must 

Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-06-03 Thread Fiona Ebner
Hi Kevin,

Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
>> The old_bs variable in bdrv_next() is currently determined by looking
>> at the old block backend. However, if the block graph changes before
>> the next bdrv_next() call, it might be that the associated BDS is not
>> the same that was referenced previously. In that case, the wrong BDS
>> is unreferenced, leading to an assertion failure later:
>>
>>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
> 
> Your change makes sense, but in theory it shouldn't make a difference.
> The real bug is in the caller, you can't allow graph modifications while
> iterating the list of nodes. Even if it doesn't crash (like after your
> patch), you don't have any guarantee that you will have seen every node
> that exists that the end - and maybe not even that you don't get the
> same node twice.
> 
>> In particular, this can happen in the context of bdrv_flush_all(),
>> when polling for bdrv_co_flush() in the generated co-wrapper leads to
>> a graph change (for example with a stream block job [0]).
> 
> The whole locking around this case is a bit tricky and would deserve
> some cleanup.
> 
> The basic rule for bdrv_next() callers is that they need to hold the
> graph reader lock as long as they are iterating the graph, otherwise
> it's not safe. This implies that the ref/unref pairs in it should never
> make a difference either - which is important, because at least
> releasing the last reference is forbidden while holding the graph lock.
> I intended to remove the ref/unref for bdrv_next(), but I didn't because
> I realised that the callers need to be audited first that they really
> obey the rules. You found one that would be problematic.
> 
> The thing that bdrv_flush_all() gets wrong is that it promises to follow
> the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> something that polls. The compiler can't catch this because bdrv_flush()
> is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
> 
> - If called outside of coroutine context, they are GRAPH_UNLOCKED
> - If called in coroutine context, they are GRAPH_RDLOCK
> 
> We should probably try harder to get rid of the mixed functions, because
> a synchronous co_wrapper_bdrv_rdlock could actually be marked
> GRAPH_UNLOCKED in the code and then the compiler could catch this case.
> 
> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> with a coroutine wrapper so that the graph lock is held for the whole
> function. Then calling bdrv_co_flush() while iterating the list is safe
> and doesn't allow concurrent graph modifications.

I attempted this now, but ran into two issues:

The first is that I had to add support for a function without arguments
to the block-coroutine-wrapper.py script. I can send this as an RFC in
any case if desired.

The second is that iotest 255 ran into an assertion failure upon QMP 'quit':

> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion 
> `!qemu_in_coroutine()' failed.

Looking at the backtrace:

> #5  0x762a90cc3eb2 in __GI___assert_fail
> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 
> "../block/graph-lock.c", line=113, function=0x5afb07991f20 
> <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> at ./assert/assert.c:101
> #6  0x5afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> #7  0x5afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at 
> ../block/block-backend.c:901
> #8  0x5afb075729a7 in blk_delete (blk=0x5afb0af99420) at 
> ../block/block-backend.c:487
> #9  0x5afb07572d88 in blk_unref (blk=0x5afb0af99420) at 
> ../block/block-backend.c:547
> #10 0x5afb07572fe8 in bdrv_next (it=0x762a852fef00) at 
> ../block/block-backend.c:618
> #11 0x5afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> #12 0x5afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x712c6050) at 
> block/block-gen.c:1391
> #13 0x5afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)

So I guess calling bdrv_next() is not safe from a coroutine, because the
function doing the iteration could end up being the last thing to have a
reference for the BB.

Best Regards,
Fiona




Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-26 Thread Kevin Wolf
Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
> 
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.

Your change makes sense, but in theory it shouldn't make a difference.
The real bug is in the caller, you can't allow graph modifications while
iterating the list of nodes. Even if it doesn't crash (like after your
patch), you don't have any guarantee that you will have seen every node
that exists that the end - and maybe not even that you don't get the
same node twice.

> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).

The whole locking around this case is a bit tricky and would deserve
some cleanup.

The basic rule for bdrv_next() callers is that they need to hold the
graph reader lock as long as they are iterating the graph, otherwise
it's not safe. This implies that the ref/unref pairs in it should never
make a difference either - which is important, because at least
releasing the last reference is forbidden while holding the graph lock.
I intended to remove the ref/unref for bdrv_next(), but I didn't because
I realised that the callers need to be audited first that they really
obey the rules. You found one that would be problematic.

The thing that bdrv_flush_all() gets wrong is that it promises to follow
the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
something that polls. The compiler can't catch this because bdrv_flush()
is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:

- If called outside of coroutine context, they are GRAPH_UNLOCKED
- If called in coroutine context, they are GRAPH_RDLOCK

We should probably try harder to get rid of the mixed functions, because
a synchronous co_wrapper_bdrv_rdlock could actually be marked
GRAPH_UNLOCKED in the code and then the compiler could catch this case.

The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
with a coroutine wrapper so that the graph lock is held for the whole
function. Then calling bdrv_co_flush() while iterating the list is safe
and doesn't allow concurrent graph modifications.

Kevin




Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-25 Thread Stefan Hajnoczi
On Fri, Mar 22, 2024 at 10:50:07AM +0100, Fiona Ebner wrote:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
> 
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.
> 
> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).
> 
> A racy reproducer:
> 
> > #!/bin/bash
> > rm -f /tmp/backing.qcow2
> > rm -f /tmp/top.qcow2
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev 
> > qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > < > {"execute": "qmp_capabilities"}
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> > "node0" } }
> > {"execute": "quit"}
> > EOF
> 
> [0]:
> 
> > #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> > #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> > errp=...)
> > #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> > detach_subchain=..., errp=...)
> > #3  bdrv_drop_filter (bs=..., errp=...)
> > #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> > #5  stream_prepare (job=...)
> > #6  job_prepare_locked (job=...)
> > #7  job_txn_apply_locked (fn=..., job=...)
> > #8  job_do_finalize_locked (job=...)
> > #9  job_exit (opaque=...)
> > #10 aio_bh_poll (ctx=...)
> > #11 aio_poll (ctx=..., blocking=...)
> > #12 bdrv_poll_co (s=...)
> > #13 bdrv_flush (bs=...)
> > #14 bdrv_flush_all ()
> > #15 do_vm_stop (state=..., send_stop=...)
> > #16 vm_shutdown ()
> 
> Signed-off-by: Fiona Ebner 
> ---
> 
> No changes in v3.
> New in v2.
> 
>  block/block-backend.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes

2024-03-22 Thread Fiona Ebner
The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:

> bdrv_unref: Assertion `bs->refcnt > 0' failed.

In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).

A racy reproducer:

> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev 
> qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> < {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": 
> "node0" } }
> {"execute": "quit"}
> EOF

[0]:

> #0  bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1  bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., 
> errp=...)
> #2  bdrv_replace_node_common (from=..., to=..., auto_skip=..., 
> detach_subchain=..., errp=...)
> #3  bdrv_drop_filter (bs=..., errp=...)
> #4  bdrv_cor_filter_drop (cor_filter_bs=...)
> #5  stream_prepare (job=...)
> #6  job_prepare_locked (job=...)
> #7  job_txn_apply_locked (fn=..., job=...)
> #8  job_do_finalize_locked (job=...)
> #9  job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()

Signed-off-by: Fiona Ebner 
---

No changes in v3.
New in v2.

 block/block-backend.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 /* Must be called from the main loop */
 assert(qemu_get_current_aio_context() == qemu_get_aio_context());
 
+old_bs = it->bs;
+
 /* First, return all root nodes of BlockBackends. In order to avoid
  * returning a BDS twice when multiple BBs refer to it, we only return it
  * if the BB is the first one in the parent list of the BDS. */
 if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
 BlockBackend *old_blk = it->blk;
 
-old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
 do {
 it->blk = blk_all_next(it->blk);
 bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
 if (bs) {
 bdrv_ref(bs);
 bdrv_unref(old_bs);
+it->bs = bs;
 return bs;
 }
 it->phase = BDRV_NEXT_MONITOR_OWNED;
-} else {
-old_bs = it->bs;
 }
 
 /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.39.2