On 4/2/20 7:10 PM, Kevin Wolf wrote:
> Am 02.04.2020 um 18:47 hat Kevin Wolf geschrieben:
>> So I think this is the bug: Calling blk_wait_while_drained() from
>> anywhere between blk_inc_in_flight() and blk_dec_in_flight() is wrong
>> because it will deadlock the drain operation.
>>
>> blk_aio_read/write_entry() take care of this and drop their reference
>> around blk_wait_while_drained(). But if we hit the race condition that
>> drain hasn't yet started there, but it has when we get to
>> blk_co_preadv() or blk_co_pwritev_part(), then we're in a buggy code
>> path.
> 
> With the following patch, it seems to survive for now. I'll give it some
> more testing tomorrow (also qemu-iotests to check that I didn't
> accidentally break something else.)
> 

So I only followed the discussion loosely, but tried some simple reproducing
to ensure it was an issue independent of some artifacts on Dietmar's setup.

Before that patch I got always a hang before reaching the fifth drive-backup
+ block-job-cancel cycle. With your patch applied I had no hang so far,
currently into >885 cycles (and yes I confirmed stress -d 5 was really
running).

So, FWIW, the patch definitively fixes the issue or at least the symptoms
here, I cannot comment on its correctness or the like at all, as I'm
currently missing to much background.

cheers,
Thomas


Reply via email to