On Thu, Jun 29, 2017 at 11:49 AM, Jeff Mahoney <je...@suse.com> wrote:
> On 6/29/17 2:46 PM, Sargun Dhillon wrote:
>> On Thu, Jun 29, 2017 at 11:42 AM, Jeff Mahoney <je...@suse.com> wrote:
>>> On 6/28/17 6:02 PM, Sargun Dhillon wrote:
>>>> On Wed, Jun 28, 2017 at 2:55 PM, Jeff Mahoney <je...@suse.com> wrote:
>>>>> On 6/27/17 5:12 PM, Jeff Mahoney wrote:
>>>>>> On 6/13/17 9:05 PM, Sargun Dhillon wrote:
>>>>>>> On Thu, Jun 8, 2017 at 11:34 AM, Sargun Dhillon <sar...@sargun.me> 
>>>>>>> wrote:
>>>>>>>> I have a deadlock caught in the wild between two processes --
>>>>>>>> btrfs-cleaner, and userspace process (Docker). Here, you can see both
>>>>>>>> of the backtraces. btrfs-cleaner is trying to get a lock on
>>>>>>>> ffff9859d360caf0, which is owned by Docker's pid. Docker on the other
>>>>>>>> hand is trying to get a lock on ffff9859dc0f0578, which is owned by
>>>>>>>> btrfs-cleaner's Pid.
>>>>>>>>
>>>>>>>> This is on vanilla 4.11.3 without much workload. The background
>>>>>>>> workload was basically starting and stopping Docker with a medium
>>>>>>>> sized image like ubuntu:latest with sleep 5. So, snapshot creation,
>>>>>>>> destruction. And there's some stuff that's logging to btrfs.
>>>>>>
>>>>>> Hi Sargun -
>>>>>>
>>>>>> We hit this bug in testing last week.  I have a patch that I've written
>>>>>> up and have run under your reproducer for a while.  So far it hasn't
>>>>>> hit.  I'll post it shortly and CC you.  It does depend lightly on the
>>>>>> rbtree code, though.  Since we'll want this fix for -stable, I'll write
>>>>>> up a version for that too.
>>>>>
>>>>> After thinking about it a bit more, I think my patch just happens to
>>>>> make it less likely to hit but would ultimately degrade into a livelock
>>>>> where it was a deadlock previously.  I was just trylocking and
>>>>> requeuing, so both threads are allowed to do other work and maybe even
>>>>> finish but ultimately if there's a true deadlock it'll hit anyway.
>>>>>
>>>>> -Jeff
>>>>>
>>>> Does it make sense to spend the time on making it so that
>>>> btrfs-cleaner has abortable operations, and the ability to abort if
>>>> the root deletion either takes too long, or if it receives a signal?
>>>> Although, such a case may result in a livelock, to me it seems like a
>>>> lot less bad than deadlocking.
>>>
>>>
>>> For now, reverting:
>>>
>>> commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7
>>> Author: Qu Wenruo <quwen...@cn.fujitsu.com>
>>> Date:   Wed Feb 15 10:43:03 2017 +0800
>>>
>>>     btrfs: qgroup: Move half of the qgroup accounting time out of commit
>>> trans
>>>
>>> ... should do the trick.
>>>
>>> -Jeff
>>>
>> I thought it was this as well, but we still saw lock-ups even after
>> reverting this change on 4.11. They were rarer, but we still saw
>> issues with locked up btrfs-transactions. It may have been due to a
>> different issue. If you want. I can try to revert this, and run a
>> workload on it to see where the exact lock-up is?
>
> Yeah, I'd be interested in those results.
>
> -Jeff
>
>
> --
> Jeff Mahoney
> SUSE Labs
>
Thanks Jeff,
Upon further analysis, it looks like rolling this back fixed the
btrfs-cleaner lock up, but the we're seeing a different hard lockup,
where num_writers on the current transaction gets stuck at 2.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to