Am 18.09.2018 um 16:12 hat Paolo Bonzini geschrieben:
> On 18/09/2018 13:34, Kevin Wolf wrote:
> >> But then basically the main issue is mirror.c's call to
> >> bdrv_drained_begin/end.  There are no other calls to
> >> bdrv_drained_begin/end inside coroutines IIRC.
> >
> > Coroutine or not doesn't matter. What matters is that you drain inside
> > some (high-level) operation that already needs to be completed itself
> > for drain to return.
> 
> Indeed.  However, are there any calls to bdrv_drained_begin/end that are
> inside inc_in_flight/dec_in_flight, and are not inside coroutines?
> (Sorry if I don't see the high-level issue, I'm more familiar with the
> low-level functioning...).

inc_in_flight is only one thing that signals pending activity.
Essentially, what we're interested in is everything that will make
bdrv_drain_poll() return true.

For a BDS that is in use by a block job, that BDS will be considered
busy as long as the job hasn't paused yet, because otherwise the job can
still issue new requests to the BDS (see child_job_drained_poll()).

A closely related fix in this series is "blockjob: Lie better in
child_job_drained_poll()". Jobs need to make bdrv_drain_poll() return
true for their BDSes even between returning from .run (before John's
recent series called .start) and between .prepare/.commit/.abort (which
used to be the deferred_to_main_loop part) because otherwise drain would
return, but the BH would still be pending and change the graph in the
middle of a drained section, which is obviously a big problem.

This is not purely theoretical, but a crash we observed in practice:
bdrv_reopen() suffered use-after-free crashes because after its initial
bdrv_subtree_drained_begin(), a node was removed from the graph in a
block job completion, but it still was in the ReopenQueue. (IIRC the
reopen belonged to one job and the completion to a second job that
completed at the same time.)

So there is no choice here: The bdrv_subtree_drained_begin() in reopen
_must_ make sure that the block job is really quiescent and no callbacks
for it are pending any more.

At the same time, the completion callbacks of the job (.prepare/.commit/
.abort) use drain themselves because they contain reopens and perform
graph reconfigurations. In this case, we don't want to wait for the
completion callback to finish because that's a deadlock.

Whether we want to wait on the job doesn't depend on where in its code
the job currently is. It only depends on whether the drain is issued by
the job itself or by someone else. This is what makes it hard to return
the right thing in child_job_drained_poll() because I don't see an easy
way to know the caller.

Just setting busy = false temporarily for calling bdrv_drained_begin()
in the job would be wrong: If another thread concurrently drained the
same BDS, it would still have to wait! The job being in the exact same
state, child_job_drained_poll() has to return different values in the
two threads.

> It seems to me that if we fixed the mirror bdrv_(co_)drained_begin
> case, then it would be good enough.

It's only one special case of the problem. If we could solve the problem
for the other job types, this one could be fixed the same way.

Kevin

Reply via email to