On Tue, Dec 22, 2020 at 12:51:27AM +0800, Liangyan wrote: > This is the race scenario based on call trace we captured which cause the > dentry leak. > > > CPU 0 CPU 1 > ovl_set_redirect lookup_fast > ovl_get_redirect __d_lookup > dget_dlock > //no lock protection here spin_lock(&dentry->d_lock) > dentry->d_lockref.count++ dentry->d_lockref.count++ > > > If we use dget_parent instead, we may have this race. > > > CPU 0 CPU 1 > ovl_set_redirect lookup_fast > ovl_get_redirect __d_lookup > dget_parent > raw_seqcount_begin(&dentry->d_seq) spin_lock(&dentry->d_lock) > lockref_get_not_zero(&ret->d_lockref) dentry->d_lockref.count++
And? lockref_get_not_zero() will observe ->d_lock held and fall back to taking it. The whole point of lockref is that counter and spinlock are next to each other. Fastpath in lockref_get_not_zero is cmpxchg on both, and it is taken only if ->d_lock is *NOT* locked. And the slow path there will do spin_lock() around the manipulations of ->count. Note that ->d_lock is simply ->d_lockref.lock; ->d_seq has nothing to do with the whole thing. The race in mainline is real; if you can observe anything of that sort with dget_parent(), we have much worse problem. Consider dget() vs. lookup_fast() - no overlayfs weirdness in sight and the same kind of concurrent access. Again, lockref primitives can be safely mixed with other threads doing operations on ->count while holding ->lock.