Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben: > 20.11.2020 20:22, Kevin Wolf wrote: > > Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben: > > > 20.11.2020 19:36, Kevin Wolf wrote: > > > > Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben: > > > > > Hi all! > > > > > > > > > > As Peter recently noted, iotest 30 accidentally fails. > > > > > > > > > > I found that Qemu crashes due to interleaving of graph-update > > > > > operations of parallel mirror and stream block-jobs. > > > > > > > > I haven't found the time yet to properly look into this or your other > > > > thread where you had a similar question, but there is one thing I'm > > > > wondering: Why can the nested job even make progress and run its > > > > completion handler? > > > > > > > > When we modify the graph, we should have drained the subtree in > > > > question, so in theory while one job finishes and modifies the graph, > > > > there should be no way for the other job to make progress and get > > > > interleaved - it shouldn't be able to start I/O requests and much less > > > > to run its completion handler and modify the graph. > > > > > > > > Are we missing drained sections somewhere or do they fail to achieve > > > > what I think they should achieve? > > > > > > > > > > It all looks like both jobs are reached their finish simultaneously. > > > So, all progress is done in both jobs. And they go concurrently to > > > completion procedures which interleaves. So, there no more io through > > > blk, which is restricted by drained sections. > > > > They can't be truly simultaneous because they run in the same thread. > > During job completion, this is the main thread. > > No, they not truly simultaneous, but completions may interleave > through nested aio_poll loops. > > > > > However as soon as job_is_completed() returns true, it seems we're not > > pausing the job any more when one of its nodes gets drained. > > > > Possibly also relevant: The job->busy = false in job_exit(). The comment > > there says it's a lie, but we might deadlock otherwise. > > > > This problem will probably affect other callers, too, which drain a > > subtree and then resonably expect that nobody will modify the graph > > until they end the drained section. So I think the problem that we need > > to address is that jobs run their completion handlers even though they > > are supposed to be paused by a drain. > > Hmm. I always thought about drained section as about thing that stops > IO requests, not other operations.. And we do graph modifications in > drained section to avoid in-flight IO requests during graph > modification.
Is there any use for an operation that only stops I/O, but doesn't prevent graph changes? I always understood it as a request to have exclusive access to a subtree, so that nobody else would touch it. > > I'm not saying that your graph modification locks are a bad idea, but > > they are probably not a complete solution. > > > > Hmm. What do you mean? It's of course not complete, as I didn't > protect every graph modification procedure.. But if we do protect all > such things and do graph modifications always under this mutex, it > should work I think. What I mean is that not only graph modifications can conflict with each other, but most callers of drain_begin/end will probably not be prepared for the graph changing under their feet, even if they don't actively change the graph themselves. Kevin