23.11.2020 14:10, Kevin Wolf wrote:
Am 23.11.2020 um 11:29 hat Vladimir Sementsov-Ogievskiy geschrieben:
23.11.2020 13:10, Kevin Wolf wrote:
Am 20.11.2020 um 19:19 hat Vladimir Sementsov-Ogievskiy geschrieben:
20.11.2020 20:22, Kevin Wolf wrote:
Am 20.11.2020 um 17:43 hat Vladimir Sementsov-Ogievskiy geschrieben:
20.11.2020 19:36, Kevin Wolf wrote:
Am 20.11.2020 um 17:16 hat Vladimir Sementsov-Ogievskiy geschrieben:
Hi all!

As Peter recently noted, iotest 30 accidentally fails.

I found that Qemu crashes due to interleaving of graph-update
operations of parallel mirror and stream block-jobs.

I haven't found the time yet to properly look into this or your other
thread where you had a similar question, but there is one thing I'm
wondering: Why can the nested job even make progress and run its
completion handler?

When we modify the graph, we should have drained the subtree in
question, so in theory while one job finishes and modifies the graph,
there should be no way for the other job to make progress and get
interleaved - it shouldn't be able to start I/O requests and much less
to run its completion handler and modify the graph.

Are we missing drained sections somewhere or do they fail to achieve
what I think they should achieve?


It all looks like both jobs are reached their finish simultaneously.
So, all progress is done in both jobs. And they go concurrently to
completion procedures which interleaves. So, there no more io through
blk, which is restricted by drained sections.

They can't be truly simultaneous because they run in the same thread.
During job completion, this is the main thread.

No, they not truly simultaneous, but completions may interleave
through nested aio_poll loops.


However as soon as job_is_completed() returns true, it seems we're not
pausing the job any more when one of its nodes gets drained.

Possibly also relevant: The job->busy = false in job_exit(). The comment
there says it's a lie, but we might deadlock otherwise.

This problem will probably affect other callers, too, which drain a
subtree and then resonably expect that nobody will modify the graph
until they end the drained section. So I think the problem that we need
to address is that jobs run their completion handlers even though they
are supposed to be paused by a drain.

Hmm. I always thought about drained section as about thing that stops
IO requests, not other operations.. And we do graph modifications in
drained section to avoid in-flight IO requests during graph
modification.

Is there any use for an operation that only stops I/O, but doesn't
prevent graph changes?

I always understood it as a request to have exclusive access to a
subtree, so that nobody else would touch it.

I'm not saying that your graph modification locks are a bad idea, but
they are probably not a complete solution.


Hmm. What do you mean? It's of course not complete, as I didn't
protect every graph modification procedure.. But if we do protect all
such things and do graph modifications always under this mutex, it
should work I think.

What I mean is that not only graph modifications can conflict with each
other, but most callers of drain_begin/end will probably not be prepared
for the graph changing under their feet, even if they don't actively
change the graph themselves.


Understand now.. Right.. Anyway, it looks as we need some kind of
mutex. As the user of drained section of course wants to do graph
modifications and even IO (for example update backing-link in
metadata). The first thing that comes to mind is to protect all
outer-most drained sections by global CoMutex and assert in
drain_begin/drain_end that the mutex is locked.

Hmm, it also looks like RW-lock, and simple IO is "read" and something
under drained section is "write".

In a way, drain _is_ the implementation of a lock. But as you've shown,
it's a buggy implementation.

What I was looking at was basically fixing the one instance of a bug
while leaving the design as it is.

My impression is that you're looking at this more radically and want to
rearchitecture the whole drain mechanism so that such bugs would be much
less likely to start with. Maybe this is a good idea, but it's probably
also a lot more effort.

Basically, for making use of more traditional locks, the naive approach
would be changing blk/bdrv_inc/dec_in_flight() so that it takes/releases
an actual coroutine lock. As you suggest, probably a CoRwLock.

I see a few non-trivial questions already for this part:

* What about requests for which bs->in_flight is increased more than
   once? Do we need some sort of recursive lock for them?

Is there reasonable example? I'd avoid recursive locks if possible, it doesn't 
make things simpler.


* How do you know whether you should take a reader or a writer lock? For
   drains called from coroutine context, maybe you could store the caller
   that "owns" the drain section in the BDS, but what about non-coroutine
   drains?

Intuitively, readers are all IO requests and writers are drained sections.

Why should we store drained-section owner somewhere?

Ah, we need to do "read" when holding write lock.. So we need a possibility for 
readers to operate without taking lock, when write lock is taken by current[*] coroutine.

And I don't see the way to make synchronization without moving everything to 
coroutine.
In non coroutine we'll dead lock on nested aio_poll loop which will do nested 
drain of another job. In coroutine we can wait on mutex more efficiently.


   What do you do if coroutine A drains and then (directly or indirectly)
   spawns coroutine B to do some work?

And that means that [*] is not current coroutine but may be some another 
coroutine started indirectly by lock owner..

So, just RW lock is not enough.. We need something like RW lock but with a 
possibility to call read operations while write lock is taken (by owner of 
write lock)..


* Multiple concurrent requests from the same origin (the drain section
   owner) shouldn't be serialised, so the CoRwLock needs to be taken once
   per origin, not once per request. Again, how do we identify origins
   and where do we store the common lock?

This makes things complex. Why not to take lock per request? IO requests will 
take read lock and go in parallel.


* Is it desirable that requests hang in bdrv_inc_in_flight() waiting for
   the lock to be released? This may be in the middle of another
   operation that needs to complete before drain_begin can return.

drain_begin should take write lock. This automatically means than all read 
locks are released..


   I seem to remember that introducing queued_requests in BlockBackend
   was already non-trivial because of potential deadlocks. We would have
   to prepare for more of the same in BlockDriverState.

   The BlockBackend code involves temporarily dropping the in_flight
   counter change that the request made, but on the BDS level we don't
   even know which counters we increased how often before reaching
   bdrv_inc_in_flight().

Do you have a different approach for placing the locks or do you have
ideas for how we would find good answers for these problems?


No, I don't have good complete architecture in-mind.. I was interested in this 
because I'm preparing a series to refactor permissions-update, and it's quite 
close. But still a lot simpler than all these problems. So, I'd start from my 
already prepared series anyway.

Do you think it worth protecting by global lock some job handlers, like I 
propose in patch 05? It's of course better to move generic job_*() functions to 
coroutine, to not make same thing for each job. It will not solve the whole 
problem but at least fix 030 iotest.. I don't remember real bugs around this 
and don't think that users really run parallel stream and commit jobs. On the 
other hand the fix should not hurt.

--
Best regards,
Vladimir

Reply via email to