Hi Jan, thanks a lot for explaining the problem. Please see my comment below.
On 01/16/2014 06:02 AM, Jan Kara wrote: > On Thu 16-01-14 07:35:58, Goldwyn Rodrigues wrote: >> On 01/15/2014 08:47 PM, Jan Kara wrote: >>> On Wed 15-01-14 17:17:55, Goldwyn Rodrigues wrote: >>>> On 01/15/2014 09:53 AM, Jan Kara wrote: >>>>> On Wed 15-01-14 09:32:39, Goldwyn Rodrigues wrote: >>>>>> On 01/15/2014 07:33 AM, Jan Kara wrote: >>>>>>> On Wed 15-01-14 06:51:51, Goldwyn Rodrigues wrote: >>>>>>>> The following patches are reverted in this patch because these >>>>>>>> patches caused regression in the unlink() calls. >>>>>>>> >>>>>>>> ea455f8ab68338ba69f5d3362b342c115bea8e13 - ocfs2: Push out dropping >>>>>>>> of dentry lock to ocfs2_wq >>>>>>>> f7b1aa69be138ad9d7d3f31fa56f4c9407f56b6a - ocfs2: Fix deadlock on >>>>>>>> umount >>>>>>>> 5fd131893793567c361ae64cbeb28a2a753bbe35 - ocfs2: Don't oops in >>>>>>>> ocfs2_kill_sb on a failed mount >>>>>>>> >>>>>>>> The regression is caused because these patches delay the iput() in case >>>>>>>> of dentry unlocks. This also delays the unlocking of the open lockres. >>>>>>>> The open lockresource is required to test if the inode can be wiped >>>>>>>> from >>>>>>>> disk on not. When the deleting node does not get the open lock, it >>>>>>>> marks >>>>>>>> it as orphan (even though it is not in use by another node/process) >>>>>>>> and causes a journal checkpoint. This delays operations following the >>>>>>>> inode eviction. This also moves the inode to the orphaned inode >>>>>>>> which further causes more I/O and a lot of unneccessary orphans. >>>>>>>> >>>>>>>> As for the fix, I tried reproducing the problem by using quotas and >>>>>>>> enabling >>>>>>>> lockdep, but the issue could not be reproduced. >>>>>>> So I was thinking about this for a while trying to remember... What >>>>>>> it >>>>>>> really boils down to is whether it is OK to call ocfs2_delete_inode() >>>>>>> from >>>>>>> DLM downconvert thread. Bear in mind that ocfs2_delete_inode() takes all >>>>>>> kinds of interesting cluster locks meaning that the downconvert thread >>>>>>> can >>>>>>> block waiting to acquire some cluster lock. That seems like asking for >>>>>>> trouble to me (i.e., what if the node holding the lock we need from >>>>>>> downconvert thread needs a lock from us first?) but maybe it is OK these >>>>>>> days. >>>>>>> >>>>>> The only lock it tries to take is the "inode lock" resource, which >>>>>> seems to be fine to take. >>>>> Why is it obviously fine? Some other node might be holding the inode >>>>> lock >>>>> and still write to the inode. If that needs a cluster lock for block >>>>> allocation and that allocation cluster lock is currently hold by our node, >>>>> we are screwed. Aren't we? >>>>> >>>> No. If another thread is using the allocation cluster lock implies >>>> we already have the inode lock. Cluster locks are also taken in a >>>> strict order in order to avoid ABBA locking. >>> I agree with what you wrote above but it's not exactly what I'm talking >>> about. What seems to be possible is: >>> NODE 1 NODE2 >>> holds dentry lock for 'foo' >>> holds inode lock for GLOBAL_BITMAP_SYSTEM_INODE >>> dquot_initialize(bar) >>> ocfs2_dquot_acquire() >>> >>> ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) >>> ... >>> downconvert thread (triggered from another >>> node or a different process from NODE2) >>> ocfs2_dentry_post_unlock() >>> ... >>> iput(foo) >>> ocfs2_evict_inode(foo) >>> ocfs2_clear_inode(foo) >>> dquot_drop(inode) >>> ... >>> ocfs2_inode_lock(USER_QUOTA_SYSTEM_INODE) >>> - blocks >>> finds we need more space in >>> quota file >>> ... >>> ocfs2_extend_no_holes() >>> >>> ocfs2_inode_lock(GLOBAL_BITMAP_SYSTEM_INODE) >>> - deadlocks waiting for >>> downconvert thread >>> >> Thanks. This explains the situation much better. Whats stopping >> NODE1 from releasing the GLOBAL_BITMAP_SYSTEM_INODE? (to break from >> hold and wait condition) > The fact that to release the lock, it has to be downconverted. And there > is only one downconvert thread (ocfs2dc) and that is busy waiting for > USER_QUOTA_SYSTEM_INODE lock... Yes that indeed seems to be a problem, thanks a lot for explaining :). I do not know much about quota's code but wondering if we can fix this problem by enforcing the lock ordering ? Can NODE2 check first if it needs space before getting lock on USER_QUOTA_SYSTEM_INODE. If it does then it should acquire GLOBAL_BITMAP_SYSTEM_INODE before acquiring lock on USER_QUOTA_SYSTEM_INODE ? (Basically we enforce GLOBAL_BITMAP_SYSTEM_INODE cannot be taken while node has USER_QUOTA_SYSTEM_INODE lock) > >>> However the positive part of this is that the race I was originally afraid >>> of - when ocfs2_delete_inode() would be triggered from >>> ocfs2_dentry_post_unlock() - isn't possible because ocfs2 seems to keep >>> invariant that node can cache dentry for 'foo' without holding inode lock >>> for 'foo' only if foo->i_nlink > 0. Thus iput() from downconvert thread >>> should never go into ocfs2_delete_inode() (however it would be nice to >>> assert this in ocfs2_dentry_post_unlock() for a peace of mind). >> No, ocfs2_dentry_convert_worker sets OCFS2_INODE_MAYBE_ORPHANED. So, >> ocfs2_delete_inode() is called even if i_nlink > 0. > Ah, too bad :(. Do you know why we do that? I have to admit inode > deletion locking was always one of the trickier part of ocfs2 locking and > I'm not sure I've fully appreciated all its details. > >>> Since the deadlock seems to be quota specific, it should be possible >>> postpone just the quota processing for the workqueue. It isn't completely >>> trivial because we still have to cleanup inode quota pointers but it should >>> be doable. I'll try to have a look at this tomorrow. >> Thanks! We need to work out a different way in order to fix this so >> that open locks are not delayed and does not hurt unlink >> performance. > Sure, I'm all for fixing this deadlock in a better way. > > Honza _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel