Re: Circular locking dependency with pblk

2017-10-06 Thread Javier González
> On 6 Oct 2017, at 01.36, Dave Chinner  wrote:
> 
> On Thu, Oct 05, 2017 at 12:53:50PM +0200, Javier González wrote:
>> Hi,
>> 
>> lockdep is reporting a circular dependency when using XFS and pblk,
>> which I am a bit confused about.
>> 
>> This happens when XFS sends a number of nested reads and (at least) one
>> of them hits partially pblk's cache. In this case, pblk will retrieve
>> the cached lbas and form a new bio, which is submitted _synchronously_
>> to the media using struct completion. The original bio is then populated
>> with the read data.
>> 
>> What lockdep complains about, is that the unlocking operation in
>> complete() has a circular dependency with ionode->i_rwsem when they both
>> happen on the same core, which is different from the core that issued
>> wait_for_completion_io_timeout() and is waiting for the partial read.
>> However, AFAIU complete() happens in interrupt context, so this should
>> not be a problem.
>> 
>> The only thing I can think of is that XFS is making an assumption when
>> submitting direct I/O that there will be no synchronous I/O coming from
>> anybody underneath.
>> 
>> Any thoughts?
> 
> The new lockdep completion tracking that was included in 4.14 can
> throw false positives because it doesn't (yet) have the capability
> to understand the multiply layered completion contexts we have in
> the storage stack.
> 
> Peter, here's another for you to ponder
> 
>> TRACE:
>> [ 8557.938336] ==
>> [ 8557.944513] WARNING: possible circular locking dependency detected
>> [ 8557.950688] 4.14.0-rc2--509811f1f2df #726 Not tainted
>> [ 8557.955739] --
>> [ 8557.961919] rocksdb:bg11/19129 is trying to acquire lock:
>> [ 8557.967319]  (>i_rwsem){}, at: [] 
>> xfs_ilock+0x1a1/0x230
>> [ 8557.975063]
>>   but now in release context of a crosslock acquired at the 
>> following:
>> [ 8557.983921]  ((complete)#5){+.+.}, at: [] 
>> pblk_fill_partial_read_bio+0x22a/0x520
>> [ 8557.993318]
>>   which lock already depends on the new lock.
>> 
>> [ 8558.001490]
>>   the existing dependency chain (in reverse order) is:
>> [ 8558.008966]
>>   -> #1 ((complete)#5){+.+.}:
>> [ 8558.014715]__lock_acquire+0x1186/0x11d0
>> [ 8558.019245]lock_acquire+0xed/0x1e0
>> [ 8558.023337]wait_for_completion_io_timeout+0x65/0x180
>> [ 8558.028989]pblk_fill_partial_read_bio+0x22a/0x520
>> [ 8558.034375]pblk_submit_read+0x158/0x3a0
>> [ 8558.038900]pblk_make_rq+0xc1/0x110
>> [ 8558.042992]generic_make_request+0x11d/0x3c0
>> [ 8558.047862]submit_bio+0x73/0x140
>> [ 8558.051781]iomap_dio_actor+0x1f8/0x3a0
>> [ 8558.056216]iomap_apply+0xa1/0x110
>> [ 8558.060220]iomap_dio_rw+0x26f/0x3c0
>> [ 8558.064398]xfs_file_dio_aio_read+0x88/0x1c0
>> [ 8558.069269]xfs_file_read_iter+0xab/0xc0
>> [ 8558.073794]__vfs_read+0xef/0x160
>> [ 8558.077709]vfs_read+0xa8/0x150
>> [ 8558.081454]SyS_pread64+0x98/0xc0
>> [ 8558.085372]entry_SYSCALL_64_fastpath+0x23/0xc2
> 
> This is completely normal, blocked on a bio completion with a
> filesystem lock held. More commonly we see this sort of thing
> through through submit_bio_wait().

Yes. We do the same construct in pblk.

> 
>> [ 8558.125636]  Possible unsafe locking scenario by crosslock:
>> 
>> [ 8558.132672]CPU0CPU1
>> [ 8558.137196]
>> [ 8558.141720]   lock(>i_rwsem);
>> [ 8558.145291]   lock((complete)#5);
>> [ 8558.149035]lock(>i_rwsem);
>> [ 8558.155118]unlock((complete)#5);
>> [ 8558.161550]
>>*** DEADLOCK ***
>> 
>> [ 8558.167461] 4 locks held by rocksdb:bg11/19129:
>> [ 8558.171984]  #0:  (>i_rwsem){}, at: [] 
>> xfs_ilock+0x1a1/0x230
>> [ 8558.180148]  #1:  (rcu_read_lock){}, at: [] 
>> __blk_mq_run_hw_queue+0xec/0x1d0
>> [ 8558.189091]  #2:  (&(>q_lock)->rlock){-.-.}, at: 
>> [] nvme_queue_rq+0x14c/0x9a0
>> [ 8558.198384]  #3:  (>wait#17){-.-.}, at: [] 
>> complete+0x1d/0x60
>> [ 8558.205948]
>>   stack backtrace:
>> [ 8558.210300] CPU: 6 PID: 19129 Comm: rocksdb:bg11 Not tainted 
>> 4.14.0-rc2--509811f1f2df #726
>> [ 8558.218550] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0a 
>> 06/23/2016
>> [ 8558.225675] Call Trace:
>> [ 8558.228123]  dump_stack+0x85/0xc7
>> [ 8558.231438]  print_circular_bug+0x226/0x330
>> [ 8558.235616]  ? print_bfs_bug+0x40/0x40
>> [ 8558.239358]  check_prev_add+0x432/0x840
>> [ 8558.243190]  ? __lock_acquire+0x736/0x11d0
>> [ 8558.247282]  lock_commit_crosslock+0x3d3/0x5d0
>> [ 8558.251718]  ? lock_commit_crosslock+0x3d3/0x5d0
>> [ 8558.256328]  complete+0x29/0x60
>> [ 8558.259469]  pblk_end_io_sync+0x12/0x20
>> [ 8558.263297]  nvm_end_io+0x2b/0x30
>> [ 

Re: Circular locking dependency with pblk

2017-10-05 Thread Dave Chinner
On Thu, Oct 05, 2017 at 12:53:50PM +0200, Javier González wrote:
> Hi,
> 
> lockdep is reporting a circular dependency when using XFS and pblk,
> which I am a bit confused about.
> 
> This happens when XFS sends a number of nested reads and (at least) one
> of them hits partially pblk's cache. In this case, pblk will retrieve
> the cached lbas and form a new bio, which is submitted _synchronously_
> to the media using struct completion. The original bio is then populated
> with the read data.
> 
> What lockdep complains about, is that the unlocking operation in
> complete() has a circular dependency with ionode->i_rwsem when they both
> happen on the same core, which is different from the core that issued
> wait_for_completion_io_timeout() and is waiting for the partial read.
> However, AFAIU complete() happens in interrupt context, so this should
> not be a problem.
> 
> The only thing I can think of is that XFS is making an assumption when
> submitting direct I/O that there will be no synchronous I/O coming from
> anybody underneath.
> 
> Any thoughts?

The new lockdep completion tracking that was included in 4.14 can
throw false positives because it doesn't (yet) have the capability
to understand the multiply layered completion contexts we have in
the storage stack.

Peter, here's another for you to ponder

> TRACE:
> [ 8557.938336] ==
> [ 8557.944513] WARNING: possible circular locking dependency detected
> [ 8557.950688] 4.14.0-rc2--509811f1f2df #726 Not tainted
> [ 8557.955739] --
> [ 8557.961919] rocksdb:bg11/19129 is trying to acquire lock:
> [ 8557.967319]  (>i_rwsem){}, at: [] 
> xfs_ilock+0x1a1/0x230
> [ 8557.975063]
>but now in release context of a crosslock acquired at the 
> following:
> [ 8557.983921]  ((complete)#5){+.+.}, at: [] 
> pblk_fill_partial_read_bio+0x22a/0x520
> [ 8557.993318]
>which lock already depends on the new lock.
> 
> [ 8558.001490]
>the existing dependency chain (in reverse order) is:
> [ 8558.008966]
>-> #1 ((complete)#5){+.+.}:
> [ 8558.014715]__lock_acquire+0x1186/0x11d0
> [ 8558.019245]lock_acquire+0xed/0x1e0
> [ 8558.023337]wait_for_completion_io_timeout+0x65/0x180
> [ 8558.028989]pblk_fill_partial_read_bio+0x22a/0x520
> [ 8558.034375]pblk_submit_read+0x158/0x3a0
> [ 8558.038900]pblk_make_rq+0xc1/0x110
> [ 8558.042992]generic_make_request+0x11d/0x3c0
> [ 8558.047862]submit_bio+0x73/0x140
> [ 8558.051781]iomap_dio_actor+0x1f8/0x3a0
> [ 8558.056216]iomap_apply+0xa1/0x110
> [ 8558.060220]iomap_dio_rw+0x26f/0x3c0
> [ 8558.064398]xfs_file_dio_aio_read+0x88/0x1c0
> [ 8558.069269]xfs_file_read_iter+0xab/0xc0
> [ 8558.073794]__vfs_read+0xef/0x160
> [ 8558.077709]vfs_read+0xa8/0x150
> [ 8558.081454]SyS_pread64+0x98/0xc0
> [ 8558.085372]entry_SYSCALL_64_fastpath+0x23/0xc2

This is completely normal, blocked on a bio completion with a
filesystem lock held. More commonly we see this sort of thing
through through submit_bio_wait().

> [ 8558.125636]  Possible unsafe locking scenario by crosslock:
> 
> [ 8558.132672]CPU0CPU1
> [ 8558.137196]
> [ 8558.141720]   lock(>i_rwsem);
> [ 8558.145291]   lock((complete)#5);
> [ 8558.149035]lock(>i_rwsem);
> [ 8558.155118]unlock((complete)#5);
> [ 8558.161550]
> *** DEADLOCK ***
> 
> [ 8558.167461] 4 locks held by rocksdb:bg11/19129:
> [ 8558.171984]  #0:  (>i_rwsem){}, at: [] 
> xfs_ilock+0x1a1/0x230
> [ 8558.180148]  #1:  (rcu_read_lock){}, at: [] 
> __blk_mq_run_hw_queue+0xec/0x1d0
> [ 8558.189091]  #2:  (&(>q_lock)->rlock){-.-.}, at: 
> [] nvme_queue_rq+0x14c/0x9a0
> [ 8558.198384]  #3:  (>wait#17){-.-.}, at: [] 
> complete+0x1d/0x60
> [ 8558.205948]
>stack backtrace:
> [ 8558.210300] CPU: 6 PID: 19129 Comm: rocksdb:bg11 Not tainted 
> 4.14.0-rc2--509811f1f2df #726
> [ 8558.218550] Hardware name: Supermicro X10SRA-F/X10SRA-F, BIOS 2.0a 
> 06/23/2016
> [ 8558.225675] Call Trace:
> [ 8558.228123]  dump_stack+0x85/0xc7
> [ 8558.231438]  print_circular_bug+0x226/0x330
> [ 8558.235616]  ? print_bfs_bug+0x40/0x40
> [ 8558.239358]  check_prev_add+0x432/0x840
> [ 8558.243190]  ? __lock_acquire+0x736/0x11d0
> [ 8558.247282]  lock_commit_crosslock+0x3d3/0x5d0
> [ 8558.251718]  ? lock_commit_crosslock+0x3d3/0x5d0
> [ 8558.256328]  complete+0x29/0x60
> [ 8558.259469]  pblk_end_io_sync+0x12/0x20
> [ 8558.263297]  nvm_end_io+0x2b/0x30
> [ 8558.266607]  nvme_nvm_end_io+0x2e/0x50

And this is where lockdep gets confused. Up until this point, it's
only ever seen the bio completion come from /outside/ the XFS
i_rwsem context (i.e. async completion form interupts). With a
synchronous 

Re: Circular locking dependency with pblk

2017-10-05 Thread Javier González
> On 5 Oct 2017, at 18.24, Jens Axboe  wrote:
> 
> On 10/05/2017 04:53 AM, Javier González wrote:
>> Hi,
>> 
>> lockdep is reporting a circular dependency when using XFS and pblk,
>> which I am a bit confused about.
>> 
>> This happens when XFS sends a number of nested reads and (at least) one
>> of them hits partially pblk's cache. In this case, pblk will retrieve
>> the cached lbas and form a new bio, which is submitted _synchronously_
>> to the media using struct completion. The original bio is then populated
>> with the read data.
>> 
>> What lockdep complains about, is that the unlocking operation in
>> complete() has a circular dependency with ionode->i_rwsem when they both
>> happen on the same core, which is different from the core that issued
>> wait_for_completion_io_timeout() and is waiting for the partial read.
>> However, AFAIU complete() happens in interrupt context, so this should
>> not be a problem.
> 
> But the very trace you are posting shows the completion being down
> inline, since we catch it at submission time:
> 
>> [ 8558.256328]  complete+0x29/0x60
>> [ 8558.259469]  pblk_end_io_sync+0x12/0x20
>> [ 8558.263297]  nvm_end_io+0x2b/0x30
>> [ 8558.266607]  nvme_nvm_end_io+0x2e/0x50
>> [ 8558.270351]  blk_mq_end_request+0x3e/0x70
>> [ 8558.274360]  nvme_complete_rq+0x1c/0xd0
>> [ 8558.278194]  nvme_pci_complete_rq+0x7b/0x130
>> [ 8558.282459]  __blk_mq_complete_request+0xa3/0x160
>> [ 8558.287156]  blk_mq_complete_request+0x16/0x20
>> [ 8558.291592]  nvme_process_cq+0xf8/0x1e0
>> [ 8558.295424]  nvme_queue_rq+0x16e/0x9a0
>> [ 8558.299172]  blk_mq_dispatch_rq_list+0x19e/0x330
>> [ 8558.303787]  ? blk_mq_flush_busy_ctxs+0x91/0x130
>> [ 8558.308400]  blk_mq_sched_dispatch_requests+0x19d/0x1d0
>> [ 8558.313617]  __blk_mq_run_hw_queue+0x12e/0x1d0
>> [ 8558.318053]  __blk_mq_delay_run_hw_queue+0xb9/0xd0
>> [ 8558.322837]  blk_mq_run_hw_queue+0x14/0x20
>> [ 8558.326928]  blk_mq_sched_insert_request+0xa4/0x180
>> [ 8558.331797]  blk_execute_rq_nowait+0x72/0xf0
>> [ 8558.336061]  nvme_nvm_submit_io+0xd9/0x130
>> [ 8558.340151]  nvm_submit_io+0x3c/0x70
>> [ 8558.343723]  pblk_submit_io+0x1b/0x20> [ 8558.347379]  
>> pblk_submit_read+0x1ec/0x3a0
> 
> [snip]
> 
> This happens since we call nvme_process_cq() after submitting IO,
> just in case there's something we can complete.
> 

Hmm. It's still interesting that the FS is allowed to take the
rw_semaphore before we get to fully complete the read bio in pblk. I'll
look into it tomorrow.

 Also, is it normal that we switch core when calling nvme_process_cq()
 on the submission path?

Javier


signature.asc
Description: Message signed with OpenPGP


Re: Circular locking dependency with pblk

2017-10-05 Thread Jens Axboe
On 10/05/2017 04:53 AM, Javier González wrote:
> Hi,
> 
> lockdep is reporting a circular dependency when using XFS and pblk,
> which I am a bit confused about.
> 
> This happens when XFS sends a number of nested reads and (at least) one
> of them hits partially pblk's cache. In this case, pblk will retrieve
> the cached lbas and form a new bio, which is submitted _synchronously_
> to the media using struct completion. The original bio is then populated
> with the read data.
> 
> What lockdep complains about, is that the unlocking operation in
> complete() has a circular dependency with ionode->i_rwsem when they both
> happen on the same core, which is different from the core that issued
> wait_for_completion_io_timeout() and is waiting for the partial read.
> However, AFAIU complete() happens in interrupt context, so this should
> not be a problem.

But the very trace you are posting shows the completion being down
inline, since we catch it at submission time:

> [ 8558.256328]  complete+0x29/0x60
> [ 8558.259469]  pblk_end_io_sync+0x12/0x20
> [ 8558.263297]  nvm_end_io+0x2b/0x30
> [ 8558.266607]  nvme_nvm_end_io+0x2e/0x50
> [ 8558.270351]  blk_mq_end_request+0x3e/0x70
> [ 8558.274360]  nvme_complete_rq+0x1c/0xd0
> [ 8558.278194]  nvme_pci_complete_rq+0x7b/0x130
> [ 8558.282459]  __blk_mq_complete_request+0xa3/0x160
> [ 8558.287156]  blk_mq_complete_request+0x16/0x20
> [ 8558.291592]  nvme_process_cq+0xf8/0x1e0
> [ 8558.295424]  nvme_queue_rq+0x16e/0x9a0
> [ 8558.299172]  blk_mq_dispatch_rq_list+0x19e/0x330
> [ 8558.303787]  ? blk_mq_flush_busy_ctxs+0x91/0x130
> [ 8558.308400]  blk_mq_sched_dispatch_requests+0x19d/0x1d0
> [ 8558.313617]  __blk_mq_run_hw_queue+0x12e/0x1d0
> [ 8558.318053]  __blk_mq_delay_run_hw_queue+0xb9/0xd0
> [ 8558.322837]  blk_mq_run_hw_queue+0x14/0x20
> [ 8558.326928]  blk_mq_sched_insert_request+0xa4/0x180
> [ 8558.331797]  blk_execute_rq_nowait+0x72/0xf0
> [ 8558.336061]  nvme_nvm_submit_io+0xd9/0x130
> [ 8558.340151]  nvm_submit_io+0x3c/0x70
> [ 8558.343723]  pblk_submit_io+0x1b/0x20> [ 8558.347379]  
> pblk_submit_read+0x1ec/0x3a0

[snip]

This happens since we call nvme_process_cq() after submitting IO,
just in case there's something we can complete.

-- 
Jens Axboe