On 6/29/17 2:46 PM, Sargun Dhillon wrote: > On Thu, Jun 29, 2017 at 11:42 AM, Jeff Mahoney <je...@suse.com> wrote: >> On 6/28/17 6:02 PM, Sargun Dhillon wrote: >>> On Wed, Jun 28, 2017 at 2:55 PM, Jeff Mahoney <je...@suse.com> wrote: >>>> On 6/27/17 5:12 PM, Jeff Mahoney wrote: >>>>> On 6/13/17 9:05 PM, Sargun Dhillon wrote: >>>>>> On Thu, Jun 8, 2017 at 11:34 AM, Sargun Dhillon <sar...@sargun.me> wrote: >>>>>>> I have a deadlock caught in the wild between two processes -- >>>>>>> btrfs-cleaner, and userspace process (Docker). Here, you can see both >>>>>>> of the backtraces. btrfs-cleaner is trying to get a lock on >>>>>>> ffff9859d360caf0, which is owned by Docker's pid. Docker on the other >>>>>>> hand is trying to get a lock on ffff9859dc0f0578, which is owned by >>>>>>> btrfs-cleaner's Pid. >>>>>>> >>>>>>> This is on vanilla 4.11.3 without much workload. The background >>>>>>> workload was basically starting and stopping Docker with a medium >>>>>>> sized image like ubuntu:latest with sleep 5. So, snapshot creation, >>>>>>> destruction. And there's some stuff that's logging to btrfs. >>>>> >>>>> Hi Sargun - >>>>> >>>>> We hit this bug in testing last week. I have a patch that I've written >>>>> up and have run under your reproducer for a while. So far it hasn't >>>>> hit. I'll post it shortly and CC you. It does depend lightly on the >>>>> rbtree code, though. Since we'll want this fix for -stable, I'll write >>>>> up a version for that too. >>>> >>>> After thinking about it a bit more, I think my patch just happens to >>>> make it less likely to hit but would ultimately degrade into a livelock >>>> where it was a deadlock previously. I was just trylocking and >>>> requeuing, so both threads are allowed to do other work and maybe even >>>> finish but ultimately if there's a true deadlock it'll hit anyway. >>>> >>>> -Jeff >>>> >>> Does it make sense to spend the time on making it so that >>> btrfs-cleaner has abortable operations, and the ability to abort if >>> the root deletion either takes too long, or if it receives a signal? >>> Although, such a case may result in a livelock, to me it seems like a >>> lot less bad than deadlocking. >> >> >> For now, reverting: >> >> commit fb235dc06fac9eaa4408ade9c8b20d45d63c89b7 >> Author: Qu Wenruo <quwen...@cn.fujitsu.com> >> Date: Wed Feb 15 10:43:03 2017 +0800 >> >> btrfs: qgroup: Move half of the qgroup accounting time out of commit >> trans >> >> ... should do the trick. >> >> -Jeff >> > I thought it was this as well, but we still saw lock-ups even after > reverting this change on 4.11. They were rarer, but we still saw > issues with locked up btrfs-transactions. It may have been due to a > different issue. If you want. I can try to revert this, and run a > workload on it to see where the exact lock-up is?
Yeah, I'd be interested in those results. -Jeff -- Jeff Mahoney SUSE Labs
signature.asc
Description: OpenPGP digital signature