On 12.11.2015 08:34, Fam Zheng wrote: > On Mon, 11/09 23:39, Max Reitz wrote: >> This patch rewrites bdrv_close_all(): Until now, all root BDSs have been >> force-closed. This is bad because it can lead to cached data not being >> flushed to disk. >> >> Instead, try to make all reference holders relinquish their reference >> voluntarily: >> >> 1. All BlockBackend users are handled by making all BBs simply eject >> their BDS tree. Since a BDS can never be on top of a BB, this will >> not cause any of the issues as seen with the force-closing of BDSs. >> The references will be relinquished and any further access to the BB >> will fail gracefully. >> 2. All BDSs which are owned by the monitor itself (because they do not >> have a BB) are relinquished next. >> 3. Besides BBs and the monitor, block jobs and other BDSs are the only >> things left that can hold a reference to BDSs. After every remaining >> block job has been canceled, there should not be any BDSs left (and >> the loop added here will always terminate (as long as NDEBUG is not >> defined), because either all_bdrv_states will be empty or there will >> not be any block job left to cancel, failing the assertion). >> >> Signed-off-by: Max Reitz <mre...@redhat.com> >> Reviewed-by: Kevin Wolf <kw...@redhat.com> >> --- >> block.c | 45 +++++++++++++++++++++++++++++++++++++-------- >> 1 file changed, 37 insertions(+), 8 deletions(-) >> >> diff --git a/block.c b/block.c >> index b935dff..e3d5aea 100644 >> --- a/block.c >> +++ b/block.c >> @@ -1905,9 +1905,7 @@ static void bdrv_close(BlockDriverState *bs) >> { >> BdrvAioNotifier *ban, *ban_next; >> >> - if (bs->job) { >> - block_job_cancel_sync(bs->job); >> - } >> + assert(!bs->job); >> >> /* Disable I/O limits and drain all pending throttled requests */ >> if (bs->throttle_state) { >> @@ -1971,13 +1969,44 @@ static void bdrv_close(BlockDriverState *bs) >> void bdrv_close_all(void) >> { >> BlockDriverState *bs; >> + AioContext *aio_context; >> + int original_refcount = 0; >> >> - QTAILQ_FOREACH(bs, &bdrv_states, device_list) { >> - AioContext *aio_context = bdrv_get_aio_context(bs); >> + /* Drop references from requests still in flight, such as canceled block >> + * jobs whose AIO context has not been polled yet */ >> + bdrv_drain_all(); >> >> - aio_context_acquire(aio_context); >> - bdrv_close(bs); >> - aio_context_release(aio_context); >> + blockdev_close_all_bdrv_states(); >> + blk_remove_all_bs(); >> + >> + /* Cancel all block jobs */ >> + while (!QTAILQ_EMPTY(&all_bdrv_states)) { >> + QTAILQ_FOREACH(bs, &all_bdrv_states, bs_list) { >> + aio_context = bdrv_get_aio_context(bs); >> + >> + aio_context_acquire(aio_context); >> + if (bs->job) { >> + /* So we can safely query the current refcount */ >> + bdrv_ref(bs); >> + original_refcount = bs->refcnt; >> + >> + block_job_cancel_sync(bs->job); >> + aio_context_release(aio_context); >> + break; >> + } >> + aio_context_release(aio_context); >> + } >> + >> + /* All the remaining BlockDriverStates are referenced directly or >> + * indirectly from block jobs, so there needs to be at least one BDS >> + * directly used by a block job */ >> + assert(bs); >> + >> + /* Wait for the block job to release its reference */ >> + while (bs->refcnt >= original_refcount) { >> + aio_poll(aio_context, true); >> + } >> + bdrv_unref(bs); > > If at this point bs->refcnt is greater than 1, why don't we care where are the > remaining references from?
We do care. A BDS will not be removed from all_bdrv_states until it is deleted (i.e. its refcount becomes 0). Therefore, this loop will continue until all BDSs have been deleted. So where might additional references come from? Since this loop only cares about direct or indirect references from block jobs, that's exactly it: (1) You might have multiple block jobs running on a BDS in the future. Then, you'll cancel them one by one, and after having canceled the first one, the refcount will still be greater than one before the bdrv_unref(). (2) Imagine a BDS A with a parent BDS B. There are block jobs running on both of them. Now, B is referenced by both its block job and by A (indirectly by the block job referencing A). If we cancel the job on B before the one on A, then the refcount on B will still be greater than 1 before bdrv_unref() because it is still referenced by its parent (until the block job on A is canceled, too). The first cannot happen right now, but the second one may, I'm not sure (depending on whether op blockers allow it). And we do make sure that there are no additional references besides: - directly from the monitor (monitor-owned BDSs) - from a BB - from block jobs - (+ everything transitively through the respective BDS tree) If there were additional references, the inner loop would at some point no longer find a BDS with a block job while all_bdrv_states is still not empty. That's what the "assert(bs)" after the inner loop is for. If you can imagine another way where a reference to a BDS may come from, that would be a bug in this patch and we have to make sure to respect that case, too. Max
signature.asc
Description: OpenPGP digital signature