On Wed, Jun 18, 2014 at 8:41 PM, Marc Dionne <marc.c.dio...@gmail.com> wrote:
> On Wed, Jun 18, 2014 at 8:08 PM, Waiman Long <waiman.l...@hp.com> wrote:
>> On 06/18/2014 08:03 PM, Marc Dionne wrote:
>>>
>>> On Wed, Jun 18, 2014 at 7:53 PM, Chris Mason<c...@fb.com>  wrote:
>>>>
>>>> On 06/18/2014 07:30 PM, Waiman Long wrote:
>>>>>
>>>>> On 06/18/2014 07:27 PM, Chris Mason wrote:
>>>>>>
>>>>>> On 06/18/2014 07:19 PM, Waiman Long wrote:
>>>>>>>
>>>>>>> On 06/18/2014 07:10 PM, Josef Bacik wrote:
>>>>>>>>
>>>>>>>> On 06/18/2014 03:47 PM, Waiman Long wrote:
>>>>>>>>>
>>>>>>>>> On 06/18/2014 06:27 PM, Josef Bacik wrote:
>>>>>>>>>>
>>>>>>>>>> On 06/18/2014 03:17 PM, Waiman Long wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 06/18/2014 04:57 PM, Marc Dionne wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I've been seeing very reproducible soft lockups with 3.16-rc1
>>>>>>>>>>>> similar
>>>>>>>>>>>> to what is reported here:
>>>>>>>>>>>>
>>>>>>>>>>>> https://urldefense.proofpoint.com/v1/url?u=http://marc.info/?l%3Dlinux-btrfs%26m%3D140290088532203%26w%3D2&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=aoagvtZMwVb16gh1HApZZL00I7eP50GurBpuEo3l%2B5g%3D%0A&s=c62558feb60a480bbb52802093de8c97b5e1f23d4100265b6120c8065bd99565
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> , along with the
>>>>>>>>>>>> occasional hard lockup, making it impossible to complete a
>>>>>>>>>>>> parallel
>>>>>>>>>>>> build on a btrfs filesystem for the package I work on.  This was
>>>>>>>>>>>> working fine just a few days before rc1.
>>>>>>>>>>>>
>>>>>>>>>>>> Bisecting brought me to the following commit:
>>>>>>>>>>>>
>>>>>>>>>>>>      commit bd01ec1a13f9a327950c8e3080096446c7804753
>>>>>>>>>>>>      Author: Waiman Long<waiman.l...@hp.com>
>>>>>>>>>>>>      Date:   Mon Feb 3 13:18:57 2014 +0100
>>>>>>>>>>>>
>>>>>>>>>>>>          x86, locking/rwlocks: Enable qrwlocks on x86
>>>>>>>>>>>>
>>>>>>>>>>>> And sure enough if I revert that commit on top of current
>>>>>>>>>>>> mainline,
>>>>>>>>>>>> I'm unable to reproduce the soft lockups and hangs.
>>>>>>>>>>>>
>>>>>>>>>>>> Marc
>>>>>>>>>>>
>>>>>>>>>>> The queue rwlock is fair. As a result, recursive read_lock is not
>>>>>>>>>>> allowed unless the task is in an interrupt context. Doing
>>>>>>>>>>> recursive
>>>>>>>>>>> read_lock will hang the process when a write_lock happens
>>>>>>>>>>> somewhere in
>>>>>>>>>>> between. Are recursive read_lock being done in the btrfs code?
>>>>>>>>>>>
>>>>>>>>>> We walk down a tree and read lock each node as we walk down, is
>>>>>>>>>> that
>>>>>>>>>> what you mean?  Or do you mean read_lock multiple times on the same
>>>>>>>>>> lock in the same process, cause we definitely don't do that.
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> Josef
>>>>>>>>>
>>>>>>>>> I meant recursively read_lock the same lock in a process.
>>>>>>>>
>>>>>>>> I take it back, we do actually do this in some cases.  Thanks,
>>>>>>>>
>>>>>>>> Josef
>>>>>>>
>>>>>>> This is what I thought when I looked at the looking code in btrfs. The
>>>>>>> unlock code doesn't clear the lock_owner pid, this may cause the
>>>>>>> lock_nested to be set incorrectly.
>>>>>>>
>>>>>>> Anyway, are you going to do something about it?
>>>>>>
>>>>>> Thanks for reporting this, we shouldn't be actually taking the lock
>>>>>> recursively.  Could you please try with lockdep enabled?  If the
>>>>>> problem
>>>>>> goes away with lockdep on, I think I know what's causing it.
>>>>>> Otherwise,
>>>>>> lockdep should clue us in.
>>>>>>
>>>>>> -chris
>>>>>
>>>>> I am not sure if lockdep will report recursive read_lock as this is
>>>>> possible in the past. If not, we certainly need to add that capability
>>>>> to it.
>>>>>
>>>>> One more thing, I saw comment in btrfs tree locking code about taking a
>>>>> read lock after taking a write (partial?) lock. That is not possible
>>>>> with even with the old rwlock code.
>>>>
>>>> With lockdep on, the clear_path_blocking function you're hitting
>>>> softlockups in is different.  Futjitsu hit a similar problem during
>>>> quota rescans, and it goes away with lockdep on.  I'm trying to nail
>>>> down where we went wrong, but please try lockdep on.
>>>>
>>>> -chris
>>>
>>> With lockdep on I'm unable to reproduce the lockups, and there are no
>>> lockdep warnings.
>>>
>>> Marc
>>
>>
>> Enabling lockdep may change the lock timing that make it hard to reproduce
>> the problem. Anyway, could you try to apply the following patch to see if it
>> shows any warning?
>>
>> -Longman
>>
>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>> index d24e433..b6c9f2e 100644
>> --- a/kernel/locking/lockdep.c
>> +++ b/kernel/locking/lockdep.c
>> @@ -1766,12 +1766,22 @@ check_deadlock(struct task_struct *curr, struct
>> held_loc
>>                 if (hlock_class(prev) != hlock_class(next))
>>                         continue;
>>
>> +#ifdef CONFIG_QUEUE_RWLOCK
>> +               /*
>> +                * Queue rwlock only allows read-after-read recursion of the
>> +                * same lock class when the latter read is in an interrupt
>> +                * context.
>> +                */
>> +               if ((read == 2) && prev->read && in_interrupt())
>> +                       return 2;
>> +#else
>>                 /*
>>                  * Allow read-after-read recursion of the same
>>                  * lock class (i.e. read_lock(lock)+read_lock(lock)):
>>                  */
>>                 if ((read == 2) && prev->read)
>>                         return 2;
>> +#endif
>>
>>                 /*
>>                  * We're holding the nest_lock, which serializes this lock's
>> @@ -1852,8 +1862,10 @@ check_prev_add(struct task_struct *curr, struct
>> held_lock
>>          * write-lock never takes any other locks, then the reads are
>>          * equivalent to a NOP.
>>          */
>> +#ifndef CONFIG_QUEUE_RWLOCK
>>         if (next->read == 2 || prev->read == 2)
>>                 return 1;
>> +#endif
>>         /*
>>          * Is the <prev> -> <next> dependency already present?
>>          *
>
> I still don't see any warnings with this patch added.  Also tried
> along with removing a couple of ifdefs on CONFIG_DEBUG_LOCK_ALLOC in
> btrfs/ctree.c - still unable to generate any warnings or lockups.
>
> Marc

And for an additional data point, just removing those
CONFIG_DEBUG_LOCK_ALLOC ifdefs looks like it's sufficient to prevent
the symptoms when lockdep is not enabled.

Marc
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to