Hi Joseph, On Wed, Dec 09, 2015 at 02:07:56PM +0800, joseph wrote: > On 2015/12/8 16:26, Eric Ren wrote: > > Hi Joseph, > > On Tue, Dec 08, 2015 at 02:15:17PM +0800, joseph wrote: > >> On 2015/12/8 12:51, Eric Ren wrote: > >>> Hi, > >>> > >>> On Tue, Dec 08, 2015 at 11:55:18AM +0800, joseph wrote: > >>>> Hi Gang, > >>>> Eric and I have discussed this case before. > >>>> Using NONBLOCK here is because there is a lock inversion between inode > >>>> lock and page lock. You can refer to the comments of > >>>> ocfs2_inode_lock_with_page for details. > >>>> Actually I have found that NONBLOCK mode is only used in lock inversion > >>>> cases. > >>> Yes, that comments can helps;-) I try to explain it, if any problem > >>> please correct me. > >>> > >>> On node 1, when calling ocfs2_inode_lock_with_page(), thread A(likely > >>> doing reading) has > >>> loced this page. Before thread A stepping into ocfs2_inode_lock_full(), > >>> on node 2, thread B(doing > >>> writing) required EX lock on the same inode, if lockres->l_level is PR, > >>> so lockres needs to > >>> been downconverted PR->NL on node1, i.e. > >>> ocfs2_blocking_ast()->ocfs2_wake_downconvert_thread(). > >>> > >>> On node 1, if tread ocfs2dc proceeds like this: > >>> ocfs2_downconvert_thread() > >>> ocfs2_downconvert_thread_do_work() > >>> ocfs2_process_blocked_lock() > >>> ocfs2_unblock_lock() > >>> lockres->l_ops->downconvert_worker(lockres, blocking) > >>> ocfs2_data_convert_worker() { > >>> ... > >>> 3557 if (blocking == DLM_LOCK_EX) { > >>> 3558 truncate_inode_pages(mapping, 0); > >>> 3559 } else { > >>> 3560 /* We only need to wait on the I/O if we're > >>> not also > >>> 3561 * truncating pages because > >>> truncate_inode_pages waits > >>> 3562 * for us above. We don't truncate pages if > >>> we're > >>> 3563 * blocking anything < EXMODE because we > >>> want to keep > >>> 3564 * them around in that case. */ > >>> 3565 filemap_fdatawait(mapping); > >>> 3566 } > >>> ... > >>> } > >>> We can see truncate_inode_pages() > >>> truncate_inode_pages_range() > >>> trylock_page()/lock_page()/find_lock_page() > >>> > >>> if thread A call ocfs2_inode_lock_full() in BLOCK way, there's a window > >>> that flag > >>> OCFS2_LOCK_BLOCED and ->l_blocking=EX have been set by BAST like > >>> ocfs2_blocking_ast()-> > >>> ocfs2_generic_handle_bast(), so thread A may be blocked. > >>> > >>> Now, ocfs2dc wants page lock, but thread A hold page lock and has been > >>> blocked because of > >>> thread B. Deadlock occurs, right? Again, please correct me if any. > >> Yes, you are right. > >>> > >>> So, the author had to use UNBLOCK way here thus create livelock problem. > >>> But, I cannot understand > >>> very clearly how livelock come up. Joseph or anyone else, could you > >>> please elaborate it? > >> Firstly it takes inode lock in NONBLOCK way, if it couldn't be fulfilled > >> currently and returns -EAGAIN, then unlock page and make downconvert go > >> on. This is to release the deadlock. > >> But we do have to take page lock and inode lock to safely accomplish the > >> request. So it has to return to VFS and retry with page lock. Here > >> taking inode lock and then unlocking it immediately in BLOCK way wants > > Thanks! But unfortunately in our case, it oftern takes long time waiting > > here > > for inode lock, for some reasons I don't know yet. This is likely the cause > > why > > ocfs2_readpage() takes long. > One time takes long time or retry many times? The former may be because One time takes long time, that's observed by ftrace ;-) > downconvert take too much time, or there are many requests ahead of this > node. And the latter may be this node still fails the race when retry. Thanks very much for your info! > > > > > I really want to know why to lock & unlock inode here? I'm glad you mention > > the reason, > > but is it the only one? > >> to cache the inode and increase the chance of the next retry. But if > > Any reason to cache the inode? And how this can increase the chance of > > retry? > > It's really interesting... Please elaborate more;-) > IMO, this is because unlock is not truly unlock. This node retry is > somewhat 'local' operation, and another node has to send messages, so > this node has more chance to win the race. Sounds reasonable, thanks!
Eric > > Thanks, > Joseph > > > > Thanks, > > Eric > >> another node successfully takes the inode lock again immediately after > >> this node unlocks inode, the flow you described above happens again and > >> this node still couldn't be fulfilled. And so forth. > >> > >>> > >>> Thanks, > >>> Eric > >>> > >>>> > >>>> Thanks, > >>>> Joseph > >>>> > >>>> On 2015/12/8 11:21, Gang He wrote: > >>>>> Hello Guys, > >>>>> > >>>>> There is a issue from the customer, who is complaining that buffer > >>>>> reading sometimes lasts too much time ( 1 - 10 seconds) in case > >>>>> reading/writing the same file from different nodes concurrently. > >>>>> According to the demo code from the customer, we also can reproduce > >>>>> this issue at home (run the test program under SLES11SP4 OCFS2 > >>>>> cluster), actually this issue can be reproduced in openSuSe 13.2 (more > >>>>> newer code), but in direct-io mode, this issue will disappear. > >>>>> Base on my investigation, the root cause is the buffer-io using > >>>>> cluster-lock is different from direct-io, I do not know why buffer-io > >>>>> use cluster-lock like this way. > >>>>> the code details are as below, > >>>>> in aops.c file, > >>>>> 281 static int ocfs2_readpage(struct file *file, struct page *page) > >>>>> 282 { > >>>>> 283 struct inode *inode = page->mapping->host; > >>>>> 284 struct ocfs2_inode_info *oi = OCFS2_I(inode); > >>>>> 285 loff_t start = (loff_t)page->index << PAGE_CACHE_SHIFT; > >>>>> 286 int ret, unlock = 1; > >>>>> 287 > >>>>> 288 trace_ocfs2_readpage((unsigned long long)oi->ip_blkno, > >>>>> 289 (page ? page->index : 0)); > >>>>> 290 > >>>>> 291 ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); > >>>>> <<== this line > >>>>> 292 if (ret != 0) { > >>>>> 293 if (ret == AOP_TRUNCATED_PAGE) > >>>>> 294 unlock = 0; > >>>>> 295 mlog_errno(ret); > >>>>> 296 goto out; > >>>>> 297 } > >>>>> > >>>>> in dlmglue.c file, > >>>>> 2 int ocfs2_inode_lock_with_page(struct inode *inode, > >>>>> 2443 struct buffer_head **ret_bh, > >>>>> 2444 int ex, > >>>>> 2445 struct page *page) > >>>>> 2446 { > >>>>> 2447 int ret; > >>>>> 2448 > >>>>> 2449 ret = ocfs2_inode_lock_full(inode, ret_bh, ex, > >>>>> OCFS2_LOCK_NONBLOCK); <<== there, why using NONBLOCK mode to get the > >>>>> cluster lock? this way will let reading IO get starvation. > >>>>> 2450 if (ret == -EAGAIN) { > >>>>> 2451 unlock_page(page); > >>>>> 2452 if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) > >>>>> 2453 ocfs2_inode_unlock(inode, ex); > >>>>> 2454 ret = AOP_TRUNCATED_PAGE; > >>>>> 2455 } > >>>>> 2456 > >>>>> 2457 return ret; > >>>>> 2458 } > >>>>> > >>>>> If you know the background behind the code, please tell us, why not use > >>>>> block way to get the lock in reading a page, then reading IO will get > >>>>> the page fairly when there is a concurrent writing IO from the other > >>>>> node. > >>>>> Second, I tried to modify that line from OCFS2_LOCK_NONBLOCK to 0 > >>>>> (switch to blocking way), the reading IO will not be blocked too much > >>>>> time (can erase the customer's complaining), but a new problem arises, > >>>>> sometimes the reading IO and writing IO get a dead lock (why dead lock? > >>>>> I am looking at). > >>>>> > >>>>> > >>>>> Thanks > >>>>> Gang > >>>>> > >>>>> > >>>>> > >>>>> . > >>>>> > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Ocfs2-devel mailing list > >>>> Ocfs2-devel@oss.oracle.com > >>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel > >>>> > >>> > >>> _______________________________________________ > >>> Ocfs2-devel mailing list > >>> Ocfs2-devel@oss.oracle.com > >>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel > >>> > >>> . > >>> > >> > >> > >> > > > > _______________________________________________ > > Ocfs2-devel mailing list > > Ocfs2-devel@oss.oracle.com > > https://oss.oracle.com/mailman/listinfo/ocfs2-devel > > > > . > > > > > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel