On Mon, Aug 21, 2017 at 05:46:00PM +0200, Peter Zijlstra wrote: > > > Booting the very latest -tip on my test machine gets me the below splat. > > Dave, TJ, FYI, lockdep grew annotations for completions; it remembers > which locks were taken before we complete() and checks none of those are > held while we wait_for_completion(). > > That is, something like: > > mutex_lock(&A); > mutex_unlock(&A); > complete(&C); > > mutex_lock(&A); > wait_for_completion(&C);
Ok, that seems reasonable... > Is now considered a problem. Note that in order for C to link to A it > needs to have observed the complete() thread acquiring it after a > wait_for_completion(), something like: > > wait_for_completion(&C) > mutex_lock(&A); > mutex_unlock(&A); > complete(&C); Sure. > > That is, only locks observed taken between C's wait_for_completion() and > it's complete() are considered. > > Now given the above observance rule and the fact that the below report > is from the complete, the thing that happened appears to be: > > > lockdep_map_acquire(&work->lockdep_map) > down_write(&A) > > down_write(&A) > wait_for_completion(&C) > > lockdep_map_acquire(&work_lockdep_map); > complete(&C) > > Which lockdep then puked over because both sides saw the same work > class. Let me get this straight, first. If we: 1. take a lock in a work queue context; and 2. in a separate context, hold that lock while we wait for a completion; and 3. Run the completion from the original workqueue where we /once/ held that lock lockdep will barf? IIUC, that's a problem because XFS does this all over the place. Let me pull the trace apart in reverse order to explain why XFS is going to throw endless false positives on this: > [ 39.159708] -> #0 ((&ioend->io_work)){+.+.}: > [ 39.166126] process_one_work+0x244/0x6b0 > [ 39.171184] worker_thread+0x48/0x3f0 > [ 39.175845] kthread+0x147/0x180 > [ 39.180020] ret_from_fork+0x2a/0x40 > [ 39.184593] 0xffffffffffffffff That's a data IO completion, which will take an inode lock if we have to run a transaction to update inode size or convert an unwritten extent. > [ 39.100611] -> #1 (&xfs_nondir_ilock_class){++++}: > [ 39.107612] __lock_acquire+0x10a1/0x1100 > [ 39.112660] lock_acquire+0xea/0x1f0 > [ 39.117224] down_write_nested+0x26/0x60 > [ 39.122184] xfs_ilock+0x166/0x220 > [ 39.126563] __xfs_setfilesize+0x30/0x200 > [ 39.131612] xfs_setfilesize_ioend+0x7f/0xb0 > [ 39.136958] xfs_end_io+0x49/0xf0 > [ 39.141240] process_one_work+0x273/0x6b0 > [ 39.146288] worker_thread+0x48/0x3f0 > [ 39.150960] kthread+0x147/0x180 > [ 39.155146] ret_from_fork+0x2a/0x40 That's the data IO completion locking the inode inside a transaction to update the inode size inside a workqueue. > [ 38.962397] -> #2 ((complete)&bp->b_iowait){+.+.}: > [ 38.969401] __lock_acquire+0x10a1/0x1100 > [ 38.974460] lock_acquire+0xea/0x1f0 > [ 38.979036] wait_for_completion+0x3b/0x130 > [ 38.984285] xfs_buf_submit_wait+0xb2/0x590 > [ 38.989535] _xfs_buf_read+0x23/0x30 > [ 38.994108] xfs_buf_read_map+0x14f/0x320 > [ 38.999169] xfs_trans_read_buf_map+0x170/0x5d0 > [ 39.004812] xfs_read_agf+0x97/0x1d0 > [ 39.009386] xfs_alloc_read_agf+0x60/0x240 > [ 39.014541] xfs_alloc_fix_freelist+0x32a/0x3d0 > [ 39.020180] xfs_free_extent_fix_freelist+0x6b/0xa0 > [ 39.026206] xfs_free_extent+0x48/0x120 > [ 39.031068] xfs_trans_free_extent+0x57/0x200 > [ 39.036512] xfs_extent_free_finish_item+0x26/0x40 > [ 39.042442] xfs_defer_finish+0x174/0x770 > [ 39.047494] xfs_itruncate_extents+0x126/0x470 > [ 39.053035] xfs_setattr_size+0x2a1/0x310 > [ 39.058093] xfs_vn_setattr_size+0x57/0x160 > [ 39.063342] xfs_vn_setattr+0x50/0x70 > [ 39.068015] notify_change+0x2ee/0x420 > [ 39.072785] do_truncate+0x5d/0x90 > [ 39.077165] path_openat+0xab2/0xc00 > [ 39.081737] do_filp_open+0x8a/0xf0 > [ 39.086213] do_sys_open+0x123/0x200 > [ 39.090787] SyS_open+0x1e/0x20 > [ 39.094876] entry_SYSCALL_64_fastpath+0x23/0xc2 And that's waiting for a metadata read IO to complete during a truncate transaction. This has no direct connection to the inode at all - it's a allocation group header that we have to lock to do block allocation. The completion it is waiting on doesn't even run through the same workqueue as the ioends - ioends go through mp->m_data_workqueue or mp->m_unwritten_workqueue, metadata buffers go through mp->m_buf_workqueue or mp->m_log_workqueue. So from that perspective, an ioend blocked on an inode lock should not ever prevent metadata buffer completions from being run. Hence I'd call this a false positive at best, though I think it really indicates a bug in the new lockdep code because it isn't discriminating between different workqueue contexts properly. Even if I ignore the fact that buffer completions are run on different workqueues, there seems to be a bigger problem with this sort of completion checking. That is, the trace looks plausible because we are definitely hold an inode locked deep inside a truncate operation where the completion if flagged. Indeed, some transactions that would flag like this could be holding up to 5 inodes locked and have tens of other metadata objects locked. There are potentially tens (maybe even hundreds) of different paths into this IO wait point, and all have different combinations of objects locked when it triggers. So there's massive scope for potential deadlocks.... .... and so we must have some way of avoiding this whole class of problems that lockdep is unaware of. Indeed, I don't think we can get stuck here because of the IO locking rules we have specifically to avoid this sort of issue. Having IO in flight and running completions during a truncate operation on the same inode is a Realy Bad Thing to be doing. It's actually a corruption and/or data exposure vector, neither of which make users happy, and so fileystems tend to have designed in mechanisms for avoiding these problems. In the case of XFS, it's the data IO serialisation model that prevents such problems. i.e. the inode locks we hold at this point in the truncate process (i.e. the XFS_IOLOCK a.k.a i_rwsem) prevent new IO from being run, and we don't start the truncate until we've waited for all in progress IO to complete. Hence while the truncate runs and blocks on metadata IO completions, no data IO can be in progress on that inode, so there is no completions being run on that inode in workqueues. And therefore the IO completion deadlock path reported by lockdep can not actually be executed during a truncate, and so it's a false positive. Back to the drawing board, I guess.... Cheers, Dave. -- Dave Chinner da...@fromorbit.com