Re: Btrfs umount hang

Angel Shtilianov Thu, 10 Aug 2017 07:49:27 -0700

Here is a fresh snapshot hang example:

Here kworker/u24:0 is blocked waiting on btrfs_tree_read_lock.
And again underlaying is btrfs_search_slot after it has gone through
cache_block_group.
The other two has blocked on the same place in cache_block_group. So I
can confirm it is hanging the same way as with umount.
Here is the struct extent_buffer passed to btrfs_tree_read_lock for
kworker/u24:0:
struct extent_buffer {
...
  lock_owner = 23286,
  write_locks = {
    counter = 1
  },
  read_locks = {
    counter = 0
  },
  blocking_writers = {
    counter = 1
  },
..
}
So again the ioctl has taken the lock needed for its underlying operation.
Btrfs pid 23286 is going to perform a write, so it can't skip locking,
and not just read locking, but write locking.
The fact that it owns the wr lock is good. skip_locking in its
btrfs_path struct is zero, so till here it is fine and locked. The
problem rises later, if cache isnt available. Then we are about to try
to do cache_block_group. Even if it is not going to hang on the
wait in the beginning in the cache_block_group function (as in current case):
while (cache->cached == BTRFS_CACHE_FAST) {
                struct btrfs_caching_control *ctl;


                ctl = cache->caching_ctl;
                atomic_inc(&ctl->count);
                prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
                spin_unlock(&cache->lock);

>                schedule();

                finish_wait(&ctl->wait, &wait);
                put_caching_control(ctl);
                spin_lock(&cache->lock);
        }
it seems that there is high chance to hang later on
btrfs_tree_read_lock even if there is absolutely no concurrent work
going on by calling:
cache_block_group
 ->load_free_space_cache
   ->lookup_free_space_inode
     ->__lookup_free_space_inode
       ->btrfs_search_slot
         ->btrfs_read_lock_root_node
           ->btrfs_tree_read_lock
since we are already holding the write lock.
It looks like if we call cache_block_group at find_free_extent under
call via btrfs_search_slot we are about to get deadlocked too in case
the data isn't cached.

It looks scary. I hope that I've missed something.

Best regards,
Angel

On Thu, Aug 10, 2017 at 1:30 AM, Angel Shtilianov <ker...@kyup.com> wrote:
> Hi Jeff,
> That's exactly what I am observing in the backtraces.
> umount reaches cache_block_group passing through btrfs_search_slot,
> which looks like got the wr_lock (to be confirmed tomorrow when I'm in
> the office) and all other btrfs worker threads are fighting for the
> btrfs_tree_read_lock.
> Umount is hanging on the cache_block_group wait, all others are
> hanging on the read_lock. So caching seems to be the bottleneck,
> blocking all others. Good catch! I'll take a look at the commit
> tomorrow morning (with large cup of coffee of course).
>
> By the way, looking through the logs I can see one more pattern -
> umount after drop snapshot reaches similar state. The common is that
> drop snapshot will also call search slot, get the readlock, drop it
> and get the write lock. In the same time there is umount causing
> fs_sync going concurently. I've seen at least three such hangs today
> and I've started the corresponding crashdumps to get copied to my
> machine. Tomorrow I'll take a look at them. It seems the common is
> that umount again reaches to the same place - cache_block_group and
> hangs. Tomorrow I'll take a look at the crash dump for the workers
> state.
>
> Thank you,
> Angel
>
> On Wed, Aug 9, 2017 at 6:42 PM, Jeff Mahoney <je...@suse.com> wrote:
>> On 8/8/17 7:30 AM, Angel Shtilianov wrote:
>>> crash> bt -f 31625
>>> PID: 31625  TASK: ffff88046a833400  CPU: 7   COMMAND: "btrfs-transacti"
>>> wants to acquire struct extent_buffer ffff88000460aca0 lock, whose
>>> lock_owner is 27574.
>>>
>>> here is pid 27574:
>>> PID: 27574  TASK: ffff88038b469a00  CPU: 4   COMMAND: "kworker/u32:9"
>>> which is also is trying to acquire eb lock ffff8802598b6200, and here
>>> the owner is 31696.
>>>
>>> 31696 is
>>> PID: 31696  TASK: ffff88044b59ce00  CPU: 5   COMMAND: "umount"
>>>
>>> So definitely here is a kind of deadlock.
>>> umount holds the lock needed by the workers to complete and waits them
>>> to complete.
>>> Lockdep wouldn't complain about that.
>>> I am still about to investigate what has previously triggered/disabled 
>>> lockdep.
>>> I have to obtain the log from the machine, but I need some time to get it.
>>>
>>> Jeff, you were right.
>>> Could you help demystifying how we ended up here?
>>
>> Hi Angel -
>>
>> It looks like a regression introduced by 291c7d2f5, but that's a very
>> old commit.  As that commit says, it's a rare occurence to hit that
>> wait, and that's probably why we haven't seen this issue sooner.
>>
>> There's potential for this to happen whenever two threads are modifying
>> the tree at once and one needs to find a free extent.  I'll need to
>> think a bit on how to fix it.
>>
>> -Jeff
>>
>>> Best regards,
>>> Angel
>>>
>>> On Mon, Aug 7, 2017 at 9:10 PM, Jeff Mahoney <je...@suse.com> wrote:
>>>> On 8/7/17 1:19 PM, Jeff Mahoney wrote:
>>>>> On 8/7/17 10:12 AM, Angel Shtilianov wrote:
>>>>>> Hi there,
>>>>>> I'm investigating sporadic hanging during btrfs umount. The FS is
>>>>>> contained in a loop mounted file.
>>>>>> I have no reproduction scenario and the issue may happen once a day or
>>>>>> once a month. It is rare, but frustrating.
>>>>>> I have a crashdump (the server has been manually crashed and collected
>>>>>> a crashdump), so I could take look through the data structures.
>>>>>> What happens is that umount is getting in D state and a the kernel
>>>>>> complains about hung tasks. We are using kernel 4.4.y The actual back
>>>>>> trace is from 4.4.70, but this happens with all the 4.4 kernels I've
>>>>>> used (4.4.30 through 4.4.70).
>>>>>> Tasks like:
>>>>>> INFO: task kworker/u32:9:27574 blocked for more than 120 seconds.
>>>>>> INFO: task kworker/u32:12:27575 blocked for more than 120 seconds.
>>>>>> INFO: task btrfs-transacti:31625 blocked for more than 120 seconds.
>>>>>> are getting blocked waiting for btrfs_tree_read_lock, which is owned
>>>>>> by task umount:31696 (which is also blocked for more than 120 seconds)
>>>>>> regarding the lock debug.
>>>>>>
>>>>>> umount is hung in "cache_block_group", see the '>' mark:
>>>>>>        while (cache->cached == BTRFS_CACHE_FAST) {
>>>>>>                 struct btrfs_caching_control *ctl;
>>>>>>
>>>>>>                 ctl = cache->caching_ctl;
>>>>>>                 atomic_inc(&ctl->count);
>>>>>>                 prepare_to_wait(&ctl->wait, &wait, TASK_UNINTERRUPTIBLE);
>>>>>>                 spin_unlock(&cache->lock);
>>>>>>
>>>>>>>                schedule();
>>>>>>
>>>>>>                 finish_wait(&ctl->wait, &wait);
>>>>>>                 put_caching_control(ctl);
>>>>>>                 spin_lock(&cache->lock);
>>>>>>         }
>>>>>>
>>>>>> The complete backtraces could be found in the attached log.
>>>>>>
>>>>>> Do you have any ideas ?
>>>>>
>>>>> Hi Angel -
>>>>>
>>>>> In your log, it says lockdep is disabled.  What tripped it earlier?
>>>>> Lockdep really should be catching locking deadlocks in situations like
>>>>> this, if that's really the underlying cause.
>>>>
>>>> Actually, I'm not sure if lockdep would catch this one.  Here's my
>>>> hypothesis:
>>>>
>>>> kworker/u32:9 is waiting on a read lock while reading the free space
>>>> cache, which means it owns the cache->cached value and will issue the
>>>> wakeup when it completes.
>>>>
>>>> umount is waiting on for the wakeup from kworker/u32:9 but is holding
>>>> some tree locks in write mode.
>>>>
>>>> If kworker/u32:9 is waiting on the locks that umount holds, we have a
>>>> deadlock.
>>>>
>>>> Can you dump the extent buffer that kworker/u32:9 is waiting on?  Part
>>>> of that will contain the PID of the holder, and if matches umount, we
>>>> found the cause.
>>>>
>>>> -Jeff
>>>>
>>>> --
>>>> Jeff Mahoney
>>>> SUSE Labs
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>> --
>> Jeff Mahoney
>> SUSE Labs
>>

[2003789.023934] sysrq: SysRq : Show Blocked State
[2003789.024159]   task                        PC stack   pid father
[2003789.024613] kworker/u24:0   D ffff88037d363470     0 17905      2 0x00000000
[2003789.025027] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[2003789.025378]  ffff88037d363470 ffff88026bbd1850 ffff88037d363438 ffffffff00000000
[2003789.025857]  ffff88046bf78000 ffff88000543b400 ffff88037d364000 ffff88026bbd1850
[2003789.026341]  ffff88000543b400 ffff88000543b400 ffff88037d3634b0 ffff88037d363488
[2003789.026817] Call Trace:
[2003789.026992]  [<ffffffff816b41ac>] schedule+0x3c/0x90
[2003789.027190]  [<ffffffffa0ac30b5>] btrfs_tree_read_lock+0xe5/0x140 [btrfs]
[2003789.027399]  [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.027594]  [<ffffffffa0a59584>] btrfs_read_lock_root_node+0x34/0x50 [btrfs]
[2003789.030864]  [<ffffffffa0a5e4db>] btrfs_search_slot+0x70b/0x9d0 [btrfs]
[2003789.031046]  [<ffffffff810812d2>] ? __might_sleep+0x52/0xb0
[2003789.031244]  [<ffffffffa0a7a3bf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[2003789.031439]  [<ffffffffa0a925e7>] btrfs_iget+0xd7/0x6a0 [btrfs]
[2003789.031629]  [<ffffffffa0acb1df>] __lookup_free_space_inode+0xdf/0x150 [btrfs]
[2003789.031925]  [<ffffffffa0accf0b>] lookup_free_space_inode+0x6b/0xe0 [btrfs]
[2003789.032109]  [<ffffffffa0ace4ca>] load_free_space_cache+0x7a/0x1b0 [btrfs]
[2003789.032300]  [<ffffffffa0a637b7>] cache_block_group+0x1b7/0x3d0 [btrfs]
[2003789.032474]  [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.032656]  [<ffffffffa0a72c9f>] find_free_extent+0x7bf/0xf40 [btrfs]
[2003789.032838]  [<ffffffffa0a73487>] btrfs_reserve_extent+0x67/0x150 [btrfs]
[2003789.033020]  [<ffffffffa0a73669>] btrfs_alloc_tree_block+0xf9/0x480 [btrfs]
[2003789.033210]  [<ffffffffa0a59f0c>] __btrfs_cow_block+0x12c/0x5a0 [btrfs]
[2003789.033399]  [<ffffffffa0a5a516>] btrfs_cow_block+0x116/0x2a0 [btrfs]
[2003789.033586]  [<ffffffffa0a5dfa6>] btrfs_search_slot+0x1d6/0x9d0 [btrfs]
[2003789.033774]  [<ffffffffa0a5ff71>] btrfs_insert_empty_items+0x71/0xc0 [btrfs]
[2003789.034075]  [<ffffffff811b69d3>] ? kmem_cache_alloc+0x203/0x2f0
[2003789.034267]  [<ffffffffa0a6001e>] btrfs_insert_item+0x5e/0xd0 [btrfs]
[2003789.034458]  [<ffffffffa0a6fa91>] btrfs_create_pending_block_groups+0x111/0x210 [btrfs]
[2003789.034764]  [<ffffffffa0a86814>] __btrfs_end_transaction+0x94/0x340 [btrfs]
[2003789.035070]  [<ffffffffa0a86ad0>] btrfs_end_transaction+0x10/0x20 [btrfs]
[2003789.035267]  [<ffffffffa0a70f20>] flush_space+0xf0/0x4b0 [btrfs]
[2003789.035460]  [<ffffffffa0a7140a>] ? btrfs_async_reclaim_metadata_space+0x12a/0x1c0 [btrfs]
[2003789.035767]  [<ffffffffa0a7140a>] ? btrfs_async_reclaim_metadata_space+0x12a/0x1c0 [btrfs]
[2003789.036072]  [<ffffffffa0a71470>] ? btrfs_async_reclaim_metadata_space+0x190/0x1c0 [btrfs]
[2003789.036384]  [<ffffffffa0a71402>] btrfs_async_reclaim_metadata_space+0x122/0x1c0 [btrfs]
[2003789.036680]  [<ffffffff81073766>] process_one_work+0x1e6/0x6b0
[2003789.036855]  [<ffffffff810736dd>] ? process_one_work+0x15d/0x6b0
[2003789.037028]  [<ffffffff81073c99>] worker_thread+0x69/0x480
[2003789.037214]  [<ffffffff81073c30>] ? process_one_work+0x6b0/0x6b0
[2003789.037396]  [<ffffffff8107a37d>] kthread+0x11d/0x140
[2003789.037573]  [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.037751]  [<ffffffff816ba8ff>] ret_from_fork+0x3f/0x70
[2003789.037925]  [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.038117] btrfs-transacti D ffff88039dcfb6a8     0 23285      2 0x00000000
[2003789.038513]  ffff88039dcfb6a8 ffff88046a26d4b0 ffff88039dcfb670 0000000000000000
[2003789.038991]  ffff88046bf59a00 ffff880003ac3400 ffff88039dcfc000 ffff880312a1c040
[2003789.039476]  ffff880126a6ba00 ffff88046a26d4b0 ffff88046a26d400 ffff88039dcfb6c0
[2003789.039953] Call Trace:
[2003789.040124]  [<ffffffff816b41ac>] schedule+0x3c/0x90
[2003789.040316]  [<ffffffffa0a63736>] cache_block_group+0x136/0x3d0 [btrfs]
[2003789.040495]  [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.040681]  [<ffffffffa0a72c9f>] find_free_extent+0x7bf/0xf40 [btrfs]
[2003789.040868]  [<ffffffffa0a73487>] btrfs_reserve_extent+0x67/0x150 [btrfs]
[2003789.041055]  [<ffffffffa0a73669>] btrfs_alloc_tree_block+0xf9/0x480 [btrfs]
[2003789.041252]  [<ffffffffa0a59f0c>] __btrfs_cow_block+0x12c/0x5a0 [btrfs]
[2003789.041441]  [<ffffffffa0a5a516>] btrfs_cow_block+0x116/0x2a0 [btrfs]
[2003789.041627]  [<ffffffffa0a5dfa6>] btrfs_search_slot+0x1d6/0x9d0 [btrfs]
[2003789.041813]  [<ffffffffa0a65175>] lookup_inline_extent_backref+0xc5/0x5b0 [btrfs]
[2003789.042107]  [<ffffffff810812d2>] ? __might_sleep+0x52/0xb0
[2003789.042301]  [<ffffffffa0a5916a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
[2003789.042492]  [<ffffffffa0a67701>] __btrfs_free_extent.isra.67+0xf1/0xcb0 [btrfs]
[2003789.042800]  [<ffffffffa0ad385d>] ? btrfs_merge_delayed_refs+0x8d/0x630 [btrfs]
[2003789.043102]  [<ffffffffa0a6b982>] ? __btrfs_run_delayed_refs+0x192/0x1630 [btrfs]
[2003789.043416]  [<ffffffffa0a6c0c0>] __btrfs_run_delayed_refs+0x8d0/0x1630 [btrfs]
[2003789.043715]  [<ffffffff810d4ea4>] ? lock_timer_base.isra.35+0x54/0x70
[2003789.043906]  [<ffffffffa0a842eb>] ? join_transaction.isra.13+0x2b/0x440 [btrfs]
[2003789.044218]  [<ffffffffa0a6fc12>] btrfs_run_delayed_refs+0x82/0x2c0 [btrfs]
[2003789.044412]  [<ffffffffa0a855f3>] btrfs_commit_transaction+0x43/0xb10 [btrfs]
[2003789.044719]  [<ffffffffa0a809ac>] transaction_kthread+0x22c/0x290 [btrfs]
[2003789.044907]  [<ffffffffa0a80780>] ? btrfs_cleanup_transaction+0x650/0x650 [btrfs]
[2003789.045210]  [<ffffffff8107a37d>] kthread+0x11d/0x140
[2003789.045385]  [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.045560]  [<ffffffff816ba8ff>] ret_from_fork+0x3f/0x70
[2003789.045735]  [<ffffffff8107a260>] ? kthread_create_on_node+0x220/0x220
[2003789.045910] btrfs           D ffff88026b22b880     0 23286  23202 0x00000000
[2003789.046291]  ffff88026b22b880 ffff88046a26d4b0 ffff88026b22b848 0000000000000000
[2003789.046767]  ffff88046bf5b400 ffff8800378e3400 ffff88026b22c000 ffff880312a1c040
[2003789.047241]  ffff88046572b400 ffff88046a26d4b0 ffff88046a26d400 ffff88026b22b898
[2003789.047720] Call Trace:
[2003789.047890]  [<ffffffff816b41ac>] schedule+0x3c/0x90
[2003789.048076]  [<ffffffffa0a63736>] cache_block_group+0x136/0x3d0 [btrfs]
[2003789.048260]  [<ffffffff810a3390>] ? wait_woken+0xb0/0xb0
[2003789.048449]  [<ffffffffa0a72c9f>] find_free_extent+0x7bf/0xf40 [btrfs]
[2003789.048638]  [<ffffffffa0a73487>] btrfs_reserve_extent+0x67/0x150 [btrfs]
[2003789.048827]  [<ffffffffa0a73669>] btrfs_alloc_tree_block+0xf9/0x480 [btrfs]
[2003789.049015]  [<ffffffffa0a59f0c>] __btrfs_cow_block+0x12c/0x5a0 [btrfs]
[2003789.049211]  [<ffffffffa0a5a516>] btrfs_cow_block+0x116/0x2a0 [btrfs]
[2003789.049414]  [<ffffffffa0a5dfa6>] btrfs_search_slot+0x1d6/0x9d0 [btrfs]
[2003789.049601]  [<ffffffffa0a5ff71>] btrfs_insert_empty_items+0x71/0xc0 [btrfs]
[2003789.049909]  [<ffffffffa0ac3758>] btrfs_insert_orphan_item+0x58/0x80 [btrfs]
[2003789.050227]  [<ffffffffa0abae48>] btrfs_ioctl_snap_destroy+0x4b8/0x7e0 [btrfs]
[2003789.050542]  [<ffffffffa0ac0e89>] btrfs_ioctl+0x899/0x27e0 [btrfs]
[2003789.050720]  [<ffffffff816b9ba7>] ? _raw_spin_unlock+0x27/0x40
[2003789.050897]  [<ffffffff8119068d>] ? handle_mm_fault+0xdbd/0x1b80
[2003789.051073]  [<ffffffff8118f926>] ? handle_mm_fault+0x56/0x1b80
[2003789.051257]  [<ffffffff811ea6ef>] do_vfs_ioctl+0x30f/0x580
[2003789.051434]  [<ffffffff810aa26f>] ? up_read+0x1f/0x40
[2003789.051605]  [<ffffffff811f6796>] ? __fget_light+0x66/0x90
[2003789.051775]  [<ffffffff811ea9d9>] SyS_ioctl+0x79/0x90
[2003789.051948]  [<ffffffff816ba55b>] entry_SYSCALL_64_fastpath+0x16/0x73
[2003848.780864] sysrq: SysRq : Trigger a crash
[2003848.781093] BUG: unable to handle kernel NULL pointer dereference at           (null)

Re: Btrfs umount hang

Reply via email to