Hi All,

I was just wondering if any of you guys had a chance to validate the
hypothesis and the proposed fix. 

Thanks,
Rahul


-----Original Message-----
From: Srivastava, Rahul 
Sent: Tuesday, August 02, 2005 8:32 AM
To: 'Marcelo Tosatti'; 'Ernie Petrides'; 'Larry Woodman'
Subject: RE: oops in 2.4.25 prune_icache() called from kswapd


Hi,

Thanks for reviewing the mail. I was thinking whether below changes in
clear_inode() will close the race window:

in clear_inode(), change line:

inode->i_state = I_CLEAR;

with below piece of code:

*****
spin_lock(&inode_lock);
while (inode->i_state & I_LOCK) {
 spin_unlock(&inode_lock);
 __wait_on_inode(inode);
 spin_lock(&inode_lock);
}
inode->i_state = I_CLEAR;
spin_unlock(&inode_lock);
*****

I feel the race is between "__sync_one()" and "iput()/clear_inode()"
(also suggested by Albert) which is as follows:

**************** race
condition*******************************************


engine 0:
| 
calls iput() and lock inode_lock. iput removes the inode from the i_list
and unlocks              |
inode_lock
|
 
|   
 
|  engine 1:

 
|  grab inode_lock and calls __sync_one()
 
|
engine 0:
|
calls clear_inode(), get past the call to "wait_on_inode()" which looks
if I_LOCK is set.         |
/* From this point onwards clear_inode() and the remainder of iput()
does not care about       |
I_LOCK or inode_lock. */
|    
 
|
 
| engine 1:

 
| Sets I_LOCK.
 
|
engine 0:
|
sets i_state = I_CLEAR
|         
iput() calls destroy_inode()
|
kmem_cache_free() returns the inode to free list of inode cache.
|
 
|  
 
| engine 1: 
 
| Goes ahead and inserts the freed inode into one of the three possible
lists.


And we endup in having a corrupted inode on the inode list.

Your thoughts please.

Thanks,
Rahul




-----Original Message-----
From: Marcelo Tosatti [mailto:[EMAIL PROTECTED]
Sent: Monday, August 01, 2005 6:08 AM
To: Srivastava, Rahul; Ernie Petrides; Larry Woodman
Subject: Re: oops in 2.4.25 prune_icache() called from kswapd


On Thu, Jul 28, 2005 at 05:57:43PM -0500, Srivastava, Rahul wrote:
> Hi Marcelo,
>
> Thanks a lot for your detailed response.
>
> However, I still don't see how adding I_CLEAR in __refile_inode is
> avoiding the race mentioned below.
>
> I understand that the I_CLEAR flag addition is done to ensure that
> __refile_inode() doesn't insert the inode in one of the four lists and
> should just return (in this particular scenario). Please correct me if
> I am wrong in my assumption.
>
> Now if we see the code, without I_CLEAR flag (unpatched code), we are
> checking for I_FREEING flag in __refile_inode(). And if a inode is
> marked for freeing (i.e., I_FREEING is set) current code also returns
> without adding it into one of the four lists.
>
> And there is no scenario, in code, wherein we can have a inode with
> I_CLEAR flag set but I_FREEING unset.

I fail to disagree: the I_FREEING flag will always be set when an
attempt
is made to set I_CLEAR (there are asserts to guarantee that).

So, I also can't understand the addition of I_CLEAR check on
__refile_inode() and its purpose.

Larry, can you clarify for us?

> In nutshell: If I_CLEAR is set that means I_FREEING will also be set.
> And since we are already checking for I_FREEING in __refile_inode,
> checking for I_CLEAR will be a kind of duplication?
>
> Hope I have not misinterpreted the complete story.
>
> Thanks,
> Rahul
>
>
> -----Original Message-----
> From: Marcelo Tosatti [mailto:[EMAIL PROTECTED]
> Sent: Thursday, July 28, 2005 5:38 AM
> To: Srivastava, Rahul
> Subject: Re: oops in 2.4.25 prune_icache() called from kswapd
>
>
> On Thu, Jul 28, 2005 at 10:39:12AM -0500, Srivastava, Rahul wrote:
> > Hi Marcelo,
> >
> > I was seeing your fix in inode.c and need a clarification.
> >
> > In the patch you have added I_CLEAR flag. What is confusing me is
> > that
>
> > "I_CLEAR" flag is set only in "clear_inode()".  And in this same
> > function we have:
> >
> > **********
> >         if (!(inode->i_state & I_FREEING))
> >                 BUG();
> > **********
> >
> > So we are setting I_CLEAR only is I_FREEING is set. If that is the
> > case, shouldn't just a check for I_FREEING is enough in
refile_inode.
> > ?
>
> The problem is that it is a race between two processors.
>
> > Basically, I am not able to make out the significance of adding
> > I_CLEAR in _refile_inode().
> >
> > +       if (inode->i_state & (I_FREEING|I_CLEAR))
>
> That will avoid __refile_inode() from putting a freed inode into
> the lists...
>
> > Thanks for your time and sorry to bother you,
> > Rahul
>
> No problem.
>
> The description goes like:
>
> The first scenerio is:
>
> 1.) cpu0 is in __sync_one() just about to call __refile_inode() after
> taking the inode_lock and clearing I_LOCK.
>
> spin_lock(&inode_lock);
> inode->i_state &= ~I_LOCK;
> if (!(inode->i_state & I_FREEING))
> __refile_inode(inode);
> wake_up(&inode->i_wait);
>
> 2.) cpu1 is in [dispose_list()] where it has dropped the inode_lock
> and calls clear_inode(). It doesnt block because I_LOCK is clear so it
> sets the inode state.
>
> void clear_inode(struct inode *inode)
> {
> ...
> wait_on_inode(inode);
> ...
> inode->i_state = I_CLEAR;
> ...
> }
>
> 3.) cpu0 calls __refile_inode which places is on one of the four
> possible inode lists
>
> static inline void __refile_inode(struct inode *inode)
> {
> if (inode->i_state & I_DIRTY)
> to = &inode->i_sb->s_dirty;
> else if (atomic_read(&inode->i_count))
> to = &inode_in_use;
> else if (inode->i_data.nrpages)
> to = &inode_unused_pagecache;
> else
> to = &inode_unused;
>
> list_del(&inode->i_list);
> list_add(&inode->i_list, to);
> }
>
> 4.) cpu1 returns from clear_inode() then calls destroy_inode() which
> kmem_cache_free()s it.
>
> static void destroy_inode(struct inode *inode)
> {
>
> if (inode->i_sb->s_op->destroy_inode)
> inode->i_sb->s_op->destroy_inode(inode);
> else
> kmem_cache_free(inode_cachep, inode);
> }
>
> 5.) at this point we have an inode that has been kmem_cache_free()'d
> that is also sitting one of the lists determined by __refile_inode(),
> that cant be good!!! Also, the code looks the same in RHEL4.
>
>
> >
> >
> >
> >
> > ============================
> >
> > FYI
> >
> > commit cc54d1333e409f714aa9c7db63f7f9ed07cc57a9
> > tree f301f581dd4389028f8b2588940d456904e552f1
> > parent 2e8f68c45925123d33d476ce369b570bd989dd9a
> > author Larry Woodman <[EMAIL PROTECTED]> Fri, 15 Jul 2005 11:32:08
> > -0400 committer Marcelo Tosatti <[EMAIL PROTECTED]> Tue, 26 Jul 2005
> > 07:52:46 -0300
> >
> >     [PATCH] workaround inode cache (prune_icache/__refile_inode) SMP
> > races
> >
> >     Over the past couple of weeks we have seen two races in the
> > inode
> > cache
> >     code. The first is between [dispose_list()] and __refile_inode()
> > and the
> >     second is between prune_icache() and truncate_inodes(). I posted
> > both of
> >     these patches but wanted to make sure they got properly reviewed
> and
> >     included in RHEL3-U6.
> >
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -297,7 +297,7 @@ static inline void __refile_inode(struct  {
> >         struct list_head *to;
> >
> > -       if (inode->i_state & I_FREEING)
> > +       if (inode->i_state & (I_FREEING|I_CLEAR))
> >                 return;
> >         if (list_empty(&inode->i_hash))
> >                 return;
> > @@ -634,7 +634,9 @@ void clear_inode(struct inode *inode)
> >                 cdput(inode->i_cdev);
> >                 inode->i_cdev = NULL;
> >         }
> > +       spin_lock(&inode_lock);
> >         inode->i_state = I_CLEAR;
> > +       spin_unlock(&inode_lock);
> >  }
> >
> >  /*
> >
> >
> > On Sun, Jun 19, 2005 at 11:07:44PM +0000, Chris Caputo wrote:
> > > My basic repro method was:
> > >
> > > --
> > > 0) start irqbalance
> > > 1) run loop_dbench, which is the following dbench script which
uses
> > >    client_plain.txt:
> > >
> > >    #!/bin/sh
> > >
> > >    while [ 1 ]
> > >    do
> > >         date
> > >         dbench 2
> > > 2) wait for oops
> > > --
> > >
> > > I think I was using dbench-2.1:
> > >
> > >   http://samba.org/ftp/tridge/dbench/dbench-2.1.tar.gz
> > >
> > > In my case irqbalance was key.  If I didn't run it I never got the
> > > problem.  I think irqbalance just did a good job of exasperating a
> > > race condition in some way.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to