On 6/29/17 2:46 PM, Sargun Dhillon wrote:
> On Thu, Jun 29, 2017 at 11:42 AM, Jeff Mahoney <je...@suse.com> wrote:
>> On 6/28/17 6:02 PM, Sargun Dhillon wrote:
>>> On Wed, Jun 28, 2017 at 2:55 PM, Jeff Mahoney <je...@suse.com> wrote:
>>>> On 6/27/17 5:12 PM, Jeff Mahoney wrote:
>>>>> On 6/13/17 9:05 PM, Sargun Dhillon wrote:
>>>>>> On Thu, Jun 8, 2017 at 11:34 AM, Sargun Dhillon <sar...@sargun.me> wrote:
>>>>>>> I have a deadlock caught in the wild between two processes --
>>>>>>> btrfs-cleaner, and userspace process (Docker). Here, you can see both
>>>>>>> of the backtraces. btrfs-cleaner is trying to get a lock on
>>>>>>> ffff9859d360caf0, which is owned by Docker's pid. Docker on the other
>>>>>>> hand is trying to get a lock on ffff9859dc0f0578, which is owned by
>>>>>>> btrfs-cleaner's Pid.
>>>>>>>
>>>>>>> This is on vanilla 4.11.3 without much workload. The background
>>>>>>> workload was basically starting and stopping Docker with a medium
>>>>>>> sized image like ubuntu:latest with sleep 5. So, snapshot creation,
>>>>>>> destruction. And there's some stuff that's logging to btrfs.
>>>>>
>>>>> Hi Sargun -
>>>>>
>>>>> We hit this bug in testing last week.  I have a patch that I've written
>>>>> up and have run under your reproducer for a while.  So far it hasn't
>>>>> hit.  I'll post it shortly and CC you.  It does depend lightly on the
>>>>> rbtree code, though.  Since we'll want this fix for -stable, I'll write
>>>>> up a version for that too.
>>>>
>>>> After thinking about it a bit more, I think my patch just happens to
>>>> make it less likely to hit but would ultimately degrade into a livelock
>>>> where it was a deadlock previously.  I was just trylocking and
>>>> requeuing, so both threads are allowed to do other work and maybe even
>>>> finish but ultimately if there's a true deadlock it'll hit anyway.
>>>>
>>>> -Jeff
>>>>
>>> Does it make sense to spend the time on making it so that
>>> btrfs-cleaner has abortable operations, and the ability to abort if
>>> the root deletion either takes too long, or if it receives a signal?
>>> Although, such a case may result in a livelock, to me it seems like a
>>> lot less bad than deadlocking.
>>
>>
>> For now, reverting:
>>
>> commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7
>> Author: Qu Wenruo <quwen...@cn.fujitsu.com>
>> Date:   Wed Feb 15 10:43:03 2017 +0800
>>
>>     btrfs: qgroup: Move half of the qgroup accounting time out of commit
>> trans
>>
>> ... should do the trick.
>>
>> -Jeff
>>
> I thought it was this as well, but we still saw lock-ups even after
> reverting this change on 4.11. They were rarer, but we still saw
> issues with locked up btrfs-transactions. It may have been due to a
> different issue. If you want. I can try to revert this, and run a
> workload on it to see where the exact lock-up is?

Yeah, I'd be interested in those results.

-Jeff


-- 
Jeff Mahoney
SUSE Labs

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to