Am 23.11.2020 um 11:29 hat Vladimir Sementsov-Ogievskiy geschrieben: > 23.11.2020 13:10, Kevin Wolf wrote: > > Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben: > > > 20.11.2020 20:22, Kevin Wolf wrote: > > > > Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben: > > > > > 20.11.2020 19:36, Kevin Wolf wrote: > > > > > > Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben: > > > > > > > Hi all! > > > > > > > > > > > > > > As Peter recently noted, iotest 30 accidentally fails. > > > > > > > > > > > > > > I found that Qemu crashes due to interleaving of graph-update > > > > > > > operations of parallel mirror and stream block-jobs. > > > > > > > > > > > > I haven't found the time yet to properly look into this or your > > > > > > other > > > > > > thread where you had a similar question, but there is one thing I'm > > > > > > wondering: Why can the nested job even make progress and run its > > > > > > completion handler? > > > > > > > > > > > > When we modify the graph, we should have drained the subtree in > > > > > > question, so in theory while one job finishes and modifies the > > > > > > graph, > > > > > > there should be no way for the other job to make progress and get > > > > > > interleaved - it shouldn't be able to start I/O requests and much > > > > > > less > > > > > > to run its completion handler and modify the graph. > > > > > > > > > > > > Are we missing drained sections somewhere or do they fail to achieve > > > > > > what I think they should achieve? > > > > > > > > > > > > > > > > It all looks like both jobs are reached their finish simultaneously. > > > > > So, all progress is done in both jobs. And they go concurrently to > > > > > completion procedures which interleaves. So, there no more io through > > > > > blk, which is restricted by drained sections. > > > > > > > > They can't be truly simultaneous because they run in the same thread. > > > > During job completion, this is the main thread. > > > > > > No, they not truly simultaneous, but completions may interleave > > > through nested aio_poll loops. > > > > > > > > > > > However as soon as job_is_completed() returns true, it seems we're not > > > > pausing the job any more when one of its nodes gets drained. > > > > > > > > Possibly also relevant: The job->busy = false in job_exit(). The comment > > > > there says it's a lie, but we might deadlock otherwise. > > > > > > > > This problem will probably affect other callers, too, which drain a > > > > subtree and then resonably expect that nobody will modify the graph > > > > until they end the drained section. So I think the problem that we need > > > > to address is that jobs run their completion handlers even though they > > > > are supposed to be paused by a drain. > > > > > > Hmm. I always thought about drained section as about thing that stops > > > IO requests, not other operations.. And we do graph modifications in > > > drained section to avoid in-flight IO requests during graph > > > modification. > > > > Is there any use for an operation that only stops I/O, but doesn't > > prevent graph changes? > > > > I always understood it as a request to have exclusive access to a > > subtree, so that nobody else would touch it. > > > > > > I'm not saying that your graph modification locks are a bad idea, but > > > > they are probably not a complete solution. > > > > > > > > > > Hmm. What do you mean? It's of course not complete, as I didn't > > > protect every graph modification procedure.. But if we do protect all > > > such things and do graph modifications always under this mutex, it > > > should work I think. > > > > What I mean is that not only graph modifications can conflict with each > > other, but most callers of drain_begin/end will probably not be prepared > > for the graph changing under their feet, even if they don't actively > > change the graph themselves. > > > > Understand now.. Right.. Anyway, it looks as we need some kind of > mutex. As the user of drained section of course wants to do graph > modifications and even IO (for example update backing-link in > metadata). The first thing that comes to mind is to protect all > outer-most drained sections by global CoMutex and assert in > drain_begin/drain_end that the mutex is locked. > > Hmm, it also looks like RW-lock, and simple IO is "read" and something > under drained section is "write".
In a way, drain _is_ the implementation of a lock. But as you've shown, it's a buggy implementation. What I was looking at was basically fixing the one instance of a bug while leaving the design as it is. My impression is that you're looking at this more radically and want to rearchitecture the whole drain mechanism so that such bugs would be much less likely to start with. Maybe this is a good idea, but it's probably also a lot more effort. Basically, for making use of more traditional locks, the naive approach would be changing blk/bdrv_inc/dec_in_flight() so that it takes/releases an actual coroutine lock. As you suggest, probably a CoRwLock. I see a few non-trivial questions already for this part: * What about requests for which bs->in_flight is increased more than once? Do we need some sort of recursive lock for them? * How do you know whether you should take a reader or a writer lock? For drains called from coroutine context, maybe you could store the caller that "owns" the drain section in the BDS, but what about non-coroutine drains? What do you do if coroutine A drains and then (directly or indirectly) spawns coroutine B to do some work? * Multiple concurrent requests from the same origin (the drain section owner) shouldn't be serialised, so the CoRwLock needs to be taken once per origin, not once per request. Again, how do we identify origins and where do we store the common lock? * Is it desirable that requests hang in bdrv_inc_in_flight() waiting for the lock to be released? This may be in the middle of another operation that needs to complete before drain_begin can return. I seem to remember that introducing queued_requests in BlockBackend was already non-trivial because of potential deadlocks. We would have to prepare for more of the same in BlockDriverState. The BlockBackend code involves temporarily dropping the in_flight counter change that the request made, but on the BDS level we don't even know which counters we increased how often before reaching bdrv_inc_in_flight(). Do you have a different approach for placing the locks or do you have ideas for how we would find good answers for these problems? Kevin