On Wed, Aug 23, 2017 at 11:31:18AM +0900, Byungchul Park wrote: > > > IIUC, that's a problem because XFS does this all over the place. > > > Let me pull the trace apart in reverse order to explain why XFS is > > > going to throw endless false positives on this: > > > > Yeah, and I agree that's not right or desired. Just trying to figure out > > how to go about fixing that. > > > > Because, with possible exception for the single threaded workqueues, > > there is no actual deadlock there unless its the very same work > > instance. Its the classic instance vs class thing. > > Currently, we do the following in process_one_work(), > > lockdep_map_acquire for a workqueue > lockdep_map_acquire for a work > > But IMHO it should be, > > lockdep_map_acquire for a pair of workqueue and work.
IMHO, we should remove workqueue's lockdep_map or work's one, and make the remaining consider the other, so that it works as if lockdep_map_acquire was used with a pair of workqueue and work, if it's needed. > Right? > > > And splitting up work classes is going to be a nightmare too. > > > > > > [ 39.159708] -> #0 ((&ioend->io_work)){+.+.}: > > > > [ 39.166126] process_one_work+0x244/0x6b0 > > > > [ 39.171184] worker_thread+0x48/0x3f0 > > > > [ 39.175845] kthread+0x147/0x180 > > > > [ 39.180020] ret_from_fork+0x2a/0x40 > > > > [ 39.184593] 0xffffffffffffffff > > > > > > That's a data IO completion, which will take an inode lock if we > > > have to run a transaction to update inode size or convert an > > > unwritten extent. > > > > > > > [ 39.100611] -> #1 (&xfs_nondir_ilock_class){++++}: > > > > [ 39.107612] __lock_acquire+0x10a1/0x1100 > > > > [ 39.112660] lock_acquire+0xea/0x1f0 > > > > [ 39.117224] down_write_nested+0x26/0x60 > > > > [ 39.122184] xfs_ilock+0x166/0x220 > > > > [ 39.126563] __xfs_setfilesize+0x30/0x200 > > > > [ 39.131612] xfs_setfilesize_ioend+0x7f/0xb0 > > > > [ 39.136958] xfs_end_io+0x49/0xf0 > > > > [ 39.141240] process_one_work+0x273/0x6b0 > > > > [ 39.146288] worker_thread+0x48/0x3f0 > > > > [ 39.150960] kthread+0x147/0x180 > > > > [ 39.155146] ret_from_fork+0x2a/0x40 > > > > > > That's the data IO completion locking the inode inside a transaction > > > to update the inode size inside a workqueue. > > > > > > > > > > [ 38.962397] -> #2 ((complete)&bp->b_iowait){+.+.}: > > > > [ 38.969401] __lock_acquire+0x10a1/0x1100 > > > > [ 38.974460] lock_acquire+0xea/0x1f0 > > > > [ 38.979036] wait_for_completion+0x3b/0x130 > > > > [ 38.984285] xfs_buf_submit_wait+0xb2/0x590 > > > > [ 38.989535] _xfs_buf_read+0x23/0x30 > > > > [ 38.994108] xfs_buf_read_map+0x14f/0x320 > > > > [ 38.999169] xfs_trans_read_buf_map+0x170/0x5d0 > > > > [ 39.004812] xfs_read_agf+0x97/0x1d0 > > > > [ 39.009386] xfs_alloc_read_agf+0x60/0x240 > > > > [ 39.014541] xfs_alloc_fix_freelist+0x32a/0x3d0 > > > > [ 39.020180] xfs_free_extent_fix_freelist+0x6b/0xa0 > > > > [ 39.026206] xfs_free_extent+0x48/0x120 > > > > [ 39.031068] xfs_trans_free_extent+0x57/0x200 > > > > [ 39.036512] xfs_extent_free_finish_item+0x26/0x40 > > > > [ 39.042442] xfs_defer_finish+0x174/0x770 > > > > [ 39.047494] xfs_itruncate_extents+0x126/0x470 > > > > [ 39.053035] xfs_setattr_size+0x2a1/0x310 > > > > [ 39.058093] xfs_vn_setattr_size+0x57/0x160 > > > > [ 39.063342] xfs_vn_setattr+0x50/0x70 > > > > [ 39.068015] notify_change+0x2ee/0x420 > > > > [ 39.072785] do_truncate+0x5d/0x90 > > > > [ 39.077165] path_openat+0xab2/0xc00 > > > > [ 39.081737] do_filp_open+0x8a/0xf0 > > > > [ 39.086213] do_sys_open+0x123/0x200 > > > > [ 39.090787] SyS_open+0x1e/0x20 > > > > [ 39.094876] entry_SYSCALL_64_fastpath+0x23/0xc2 > > > > > > And that's waiting for a metadata read IO to complete during a > > > truncate transaction. This has no direct connection to the inode at > > > all - it's a allocation group header that we have to lock to do > > > block allocation. The completion it is waiting on doesn't even run > > > through the same workqueue as the ioends - ioends go through > > > mp->m_data_workqueue or mp->m_unwritten_workqueue, metadata buffers > > > go through mp->m_buf_workqueue or mp->m_log_workqueue. > > > > > > So from that perspective, an ioend blocked on an inode lock should > > > not ever prevent metadata buffer completions from being run. Hence > > > I'd call this a false positive at best, though I think it really > > > indicates a bug in the new lockdep code because it isn't > > > discriminating between different workqueue contexts properly. > > > > Agreed. The interaction with workqueues is buggered. > > I think original uses of lockdep_map were already wrong. I mean it's I meant 'uses in workqueue'. > not a problem of newly introduced code. > > > > Even if I ignore the fact that buffer completions are run on > > > different workqueues, there seems to be a bigger problem with this > > > sort of completion checking. > > > > > > That is, the trace looks plausible because we are definitely hold an > > > inode locked deep inside a truncate operation where the completion > > > if flagged. Indeed, some transactions that would flag like this > > > could be holding up to 5 inodes locked and have tens of other > > > metadata objects locked. There are potentially tens (maybe even > > > hundreds) of different paths into this IO wait point, and all have > > > different combinations of objects locked when it triggers. So > > > there's massive scope for potential deadlocks.... > > > > > > .... and so we must have some way of avoiding this whole class of > > > problems that lockdep is unaware of. > > > > > > So I did the below little hack, which basically wipes the entire lock > > history when we start a work and thereby disregards/looses the > > dependency on the work 'lock'. > > We should not do the following. Why should we give up valuable > dependencies? > > > > > It makes my test box able to boot and build a kernel on XFS, so while I > > see what you're saying (I think), it doesn't appear to instantly show. > > > > Should I run xfstests or something to further verify things are OK? Does > > that need a scratch partition (I keep forgetting how to run that stuff > > :/). > > > > --- > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > index 66011c9f5df3..de91cdce9460 100644 > > --- a/kernel/locking/lockdep.c > > +++ b/kernel/locking/lockdep.c > > @@ -4756,10 +4756,14 @@ void crossrelease_hist_start(enum xhlock_context_t > > c) > > { > > struct task_struct *cur = current; > > > > - if (cur->xhlocks) { > > - cur->xhlock_idx_hist[c] = cur->xhlock_idx; > > - cur->hist_id_save[c] = cur->hist_id; > > - } > > + if (!cur->xhlocks) > > + return; > > + > > + if (c == XHLOCK_PROC) > > + invalidate_xhlock(&xhlock(cur->xhlock_idx)); > > + > > + cur->xhlock_idx_hist[c] = cur->xhlock_idx; > > + cur->hist_id_save[c] = cur->hist_id; > > } > > > > void crossrelease_hist_end(enum xhlock_context_t c)