Hi All, I was just wondering if any of you guys had a chance to validate the hypothesis and the proposed fix.
Thanks, Rahul -----Original Message----- From: Srivastava, Rahul Sent: Tuesday, August 02, 2005 8:32 AM To: 'Marcelo Tosatti'; 'Ernie Petrides'; 'Larry Woodman' Subject: RE: oops in 2.4.25 prune_icache() called from kswapd Hi, Thanks for reviewing the mail. I was thinking whether below changes in clear_inode() will close the race window: in clear_inode(), change line: inode->i_state = I_CLEAR; with below piece of code: ***** spin_lock(&inode_lock); while (inode->i_state & I_LOCK) { spin_unlock(&inode_lock); __wait_on_inode(inode); spin_lock(&inode_lock); } inode->i_state = I_CLEAR; spin_unlock(&inode_lock); ***** I feel the race is between "__sync_one()" and "iput()/clear_inode()" (also suggested by Albert) which is as follows: **************** race condition******************************************* engine 0: | calls iput() and lock inode_lock. iput removes the inode from the i_list and unlocks | inode_lock | | | engine 1: | grab inode_lock and calls __sync_one() | engine 0: | calls clear_inode(), get past the call to "wait_on_inode()" which looks if I_LOCK is set. | /* From this point onwards clear_inode() and the remainder of iput() does not care about | I_LOCK or inode_lock. */ | | | engine 1: | Sets I_LOCK. | engine 0: | sets i_state = I_CLEAR | iput() calls destroy_inode() | kmem_cache_free() returns the inode to free list of inode cache. | | | engine 1: | Goes ahead and inserts the freed inode into one of the three possible lists. And we endup in having a corrupted inode on the inode list. Your thoughts please. Thanks, Rahul -----Original Message----- From: Marcelo Tosatti [mailto:[EMAIL PROTECTED] Sent: Monday, August 01, 2005 6:08 AM To: Srivastava, Rahul; Ernie Petrides; Larry Woodman Subject: Re: oops in 2.4.25 prune_icache() called from kswapd On Thu, Jul 28, 2005 at 05:57:43PM -0500, Srivastava, Rahul wrote: > Hi Marcelo, > > Thanks a lot for your detailed response. > > However, I still don't see how adding I_CLEAR in __refile_inode is > avoiding the race mentioned below. > > I understand that the I_CLEAR flag addition is done to ensure that > __refile_inode() doesn't insert the inode in one of the four lists and > should just return (in this particular scenario). Please correct me if > I am wrong in my assumption. > > Now if we see the code, without I_CLEAR flag (unpatched code), we are > checking for I_FREEING flag in __refile_inode(). And if a inode is > marked for freeing (i.e., I_FREEING is set) current code also returns > without adding it into one of the four lists. > > And there is no scenario, in code, wherein we can have a inode with > I_CLEAR flag set but I_FREEING unset. I fail to disagree: the I_FREEING flag will always be set when an attempt is made to set I_CLEAR (there are asserts to guarantee that). So, I also can't understand the addition of I_CLEAR check on __refile_inode() and its purpose. Larry, can you clarify for us? > In nutshell: If I_CLEAR is set that means I_FREEING will also be set. > And since we are already checking for I_FREEING in __refile_inode, > checking for I_CLEAR will be a kind of duplication? > > Hope I have not misinterpreted the complete story. > > Thanks, > Rahul > > > -----Original Message----- > From: Marcelo Tosatti [mailto:[EMAIL PROTECTED] > Sent: Thursday, July 28, 2005 5:38 AM > To: Srivastava, Rahul > Subject: Re: oops in 2.4.25 prune_icache() called from kswapd > > > On Thu, Jul 28, 2005 at 10:39:12AM -0500, Srivastava, Rahul wrote: > > Hi Marcelo, > > > > I was seeing your fix in inode.c and need a clarification. > > > > In the patch you have added I_CLEAR flag. What is confusing me is > > that > > > "I_CLEAR" flag is set only in "clear_inode()". And in this same > > function we have: > > > > ********** > > if (!(inode->i_state & I_FREEING)) > > BUG(); > > ********** > > > > So we are setting I_CLEAR only is I_FREEING is set. If that is the > > case, shouldn't just a check for I_FREEING is enough in refile_inode. > > ? > > The problem is that it is a race between two processors. > > > Basically, I am not able to make out the significance of adding > > I_CLEAR in _refile_inode(). > > > > + if (inode->i_state & (I_FREEING|I_CLEAR)) > > That will avoid __refile_inode() from putting a freed inode into > the lists... > > > Thanks for your time and sorry to bother you, > > Rahul > > No problem. > > The description goes like: > > The first scenerio is: > > 1.) cpu0 is in __sync_one() just about to call __refile_inode() after > taking the inode_lock and clearing I_LOCK. > > spin_lock(&inode_lock); > inode->i_state &= ~I_LOCK; > if (!(inode->i_state & I_FREEING)) > __refile_inode(inode); > wake_up(&inode->i_wait); > > 2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock > and calls clear_inode(). It doesnt block because I_LOCK is clear so it > sets the inode state. > > void clear_inode(struct inode *inode) > { > ... > wait_on_inode(inode); > ... > inode->i_state = I_CLEAR; > ... > } > > 3.) cpu0 calls __refile_inode which places is on one of the four > possible inode lists > > static inline void __refile_inode(struct inode *inode) > { > if (inode->i_state & I_DIRTY) > to = &inode->i_sb->s_dirty; > else if (atomic_read(&inode->i_count)) > to = &inode_in_use; > else if (inode->i_data.nrpages) > to = &inode_unused_pagecache; > else > to = &inode_unused; > > list_del(&inode->i_list); > list_add(&inode->i_list, to); > } > > 4.) cpu1 returns from clear_inode() then calls destroy_inode() which > kmem_cache_free()s it. > > static void destroy_inode(struct inode *inode) > { > > if (inode->i_sb->s_op->destroy_inode) > inode->i_sb->s_op->destroy_inode(inode); > else > kmem_cache_free(inode_cachep, inode); > } > > 5.) at this point we have an inode that has been kmem_cache_free()'d > that is also sitting one of the lists determined by __refile_inode(), > that cant be good!!! Also, the code looks the same in RHEL4. > > > > > > > > > > > > ============================ > > > > FYI > > > > commit cc54d1333e409f714aa9c7db63f7f9ed07cc57a9 > > tree f301f581dd4389028f8b2588940d456904e552f1 > > parent 2e8f68c45925123d33d476ce369b570bd989dd9a > > author Larry Woodman <[EMAIL PROTECTED]> Fri, 15 Jul 2005 11:32:08 > > -0400 committer Marcelo Tosatti <[EMAIL PROTECTED]> Tue, 26 Jul 2005 > > 07:52:46 -0300 > > > > [PATCH] workaround inode cache (prune_icache/__refile_inode) SMP > > races > > > > Over the past couple of weeks we have seen two races in the > > inode > > cache > > code. The first is between [dispose_list()] and __refile_inode() > > and the > > second is between prune_icache() and truncate_inodes(). I posted > > both of > > these patches but wanted to make sure they got properly reviewed > and > > included in RHEL3-U6. > > > > --- a/fs/inode.c > > +++ b/fs/inode.c > > @@ -297,7 +297,7 @@ static inline void __refile_inode(struct { > > struct list_head *to; > > > > - if (inode->i_state & I_FREEING) > > + if (inode->i_state & (I_FREEING|I_CLEAR)) > > return; > > if (list_empty(&inode->i_hash)) > > return; > > @@ -634,7 +634,9 @@ void clear_inode(struct inode *inode) > > cdput(inode->i_cdev); > > inode->i_cdev = NULL; > > } > > + spin_lock(&inode_lock); > > inode->i_state = I_CLEAR; > > + spin_unlock(&inode_lock); > > } > > > > /* > > > > > > On Sun, Jun 19, 2005 at 11:07:44PM +0000, Chris Caputo wrote: > > > My basic repro method was: > > > > > > -- > > > 0) start irqbalance > > > 1) run loop_dbench, which is the following dbench script which uses > > > client_plain.txt: > > > > > > #!/bin/sh > > > > > > while [ 1 ] > > > do > > > date > > > dbench 2 > > > 2) wait for oops > > > -- > > > > > > I think I was using dbench-2.1: > > > > > > http://samba.org/ftp/tridge/dbench/dbench-2.1.tar.gz > > > > > > In my case irqbalance was key. If I didn't run it I never got the > > > problem. I think irqbalance just did a good job of exasperating a > > > race condition in some way. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html