Re: [patch 0/2] i_version update
On Thu, 2007-05-31 at 10:33 +1000, David Chinner wrote: > On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote: > > On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: > > > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > > > > Hi, > > > > > > > > This is an update of the i_version patch. > > > > The i_version field is a 64bit counter that is set on every inode > > > > creation and that is incremented every time the inode data is modified > > > > (similarly to the "ctime" time-stamp). > > > > > > My understanding (please correct me if I'm wrong) is that the > > > requirements are much more rigourous than simply incrementing an in > > > memory counter on every change. i.e. the this counter has to > > > survive server crashes intact so clients never see the counter go > > > backwards. That means version number changes need to be journalled > > > along with the operation that caused the change of the version > > > number. > > > > > Yeah, the i_version is the in memeory counter. From the patch it looks > > like the counter is being updated inside ext4_mark_iloc_dirty(), so it > > is being journalled and being flush to on-disk ext4 inode structure > > immediately (On-disk ext4 inode structure is being modified/expanded to > > store the counter in the first patch). > > Ok, that catches most things (I missed that), but the version number still > needs to change on file data changes, right? So if we are overwriting the > file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't > get the callout and so the version number doesn't change or get logged. In > that case, the version number is not doing what it needs to do, right? > Hmm, maybe I missed something... but looking at the code again, in the case of overwrite (file date updated),it seems the ctime/mtime is being updated and the inode is being dirtied, so the version number is being updated. vfs_write()->.. ->__generic_file_aio_write_nolock() ->file_update_time() ->mark_inode_dirty_sync() ->__mark_inode_dirty(I_DIRTY_SYNC) ->ext4_dirty_inode() ->ext4_mark_inode_dirty() Regards, Mingming - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Thu, 2007-05-31 at 10:01 +1000, Neil Brown wrote: > This will provide a change number that normally changes only when the > file changes and doesn't require any extra storage on disk. > The change number will change inappropriately only when the inode has > fallen out of cache and is being reload, which is either after a crash > (hopefully rare) of when a file hasn't been used for a while, implying > that it is unlikely that any client has it in cache. It will also change inappropriately if the server is under heavy load and needs to reclaim memory by tossing out inodes that are cached and still in use by the clients. That change will trigger clients to invalidate their caches and to refetch the data from the server, further cranking up the load. Not an ideal solution... Trond - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote: > On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: > > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > > > Hi, > > > > > > This is an update of the i_version patch. > > > The i_version field is a 64bit counter that is set on every inode > > > creation and that is incremented every time the inode data is modified > > > (similarly to the "ctime" time-stamp). > > > > My understanding (please correct me if I'm wrong) is that the > > requirements are much more rigourous than simply incrementing an in > > memory counter on every change. i.e. the this counter has to > > survive server crashes intact so clients never see the counter go > > backwards. That means version number changes need to be journalled > > along with the operation that caused the change of the version > > number. > > > Yeah, the i_version is the in memeory counter. From the patch it looks > like the counter is being updated inside ext4_mark_iloc_dirty(), so it > is being journalled and being flush to on-disk ext4 inode structure > immediately (On-disk ext4 inode structure is being modified/expanded to > store the counter in the first patch). Ok, that catches most things (I missed that), but the version number still needs to change on file data changes, right? So if we are overwriting the file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't get the callout and so the version number doesn't change or get logged. In that case, the version number is not doing what it needs to do, right? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Wednesday May 30, [EMAIL PROTECTED] wrote: > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > > > The aim is to fulfill a NFSv4 requirement for rfc3530: > > "5.5. Mandatory Attributes - Definitions > > Name# DataType Access Description > > ___ > > change 3 uint64 READ A value created by the > > server that the client can use to determine if file > > data, directory contents or attributes of the object > > > File data writes are included in this list of things that need to > increment the version field. Hence to fulfill the crash requirement, > that implies server data writes either need to be synchronous or > journalled... I think that would be excessive. The important property if the 'change' number is: If the 'change' number is still the same, then the data and metadata etc is still the same. The converse of this (if the data+metadata is still then same then the 'change' is still the same) is valuable but less critical. Having the 'change' change when it doesn't need to will defeat client-side caching and so will reduce performance but not correctness. So after a crash, I think it is perfectly acceptable to assign a change number that is known to be different to anything previously supplied if there is any doubt about recent change history. e.g. suppose we had a filesystem with 1-second resolution mtime, and an in-memory 'change' counter that was incremented on every change. When we load an inode from storage, we initialise the counter to -1: if the mtime is earlier than current_seconds current_nanoseconds: if the mtime is equal to current_seconds. We arrange that when the ctime changes, the change number is reset to 0. Then when the 'change' number of an inode is required, we use the bottom 32bits of the 'change' counter and the 32bits of the mtime. This will provide a change number that normally changes only when the file changes and doesn't require any extra storage on disk. The change number will change inappropriately only when the inode has fallen out of cache and is being reload, which is either after a crash (hopefully rare) of when a file hasn't been used for a while, implying that it is unlikely that any client has it in cache. So in summary: I think it is impossible to have a change number that changes *only* when content changes (as your 'crash' example suggests) and it is quite achievable to have a change number that changes rarely when the content doesn't change. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote: > On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > > Hi, > > > > This is an update of the i_version patch. > > The i_version field is a 64bit counter that is set on every inode > > creation and that is incremented every time the inode data is modified > > (similarly to the "ctime" time-stamp). > > My understanding (please correct me if I'm wrong) is that the > requirements are much more rigourous than simply incrementing an in > memory counter on every change. i.e. the this counter has to > survive server crashes intact so clients never see the counter go > backwards. That means version number changes need to be journalled > along with the operation that caused the change of the version > number. > Yeah, the i_version is the in memeory counter. From the patch it looks like the counter is being updated inside ext4_mark_iloc_dirty(), so it is being journalled and being flush to on-disk ext4 inode structure immediately (On-disk ext4 inode structure is being modified/expanded to store the counter in the first patch). >This patch is on top of i_version_update_vfs. > The i_version field of the inode is set on inode creation and incremented when > the inode is being modified. > > Signed-off-by: Jean Noel Cordenner <[EMAIL PROTECTED]> > > Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c > === > --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 > 18:05:28.0 +0200 > +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c 2007-05-25 18:05:40.0 > +0200 > @@ -565,6 +565,7 @@ > inode->i_blocks = 0; > inode->i_mtime = inode->i_atime = inode->i_ctime = ei->i_crtime = > ext4_current_time(inode); > + inode->i_version = 1; > > memset(ei->i_data, 0, sizeof(ei->i_data)); > ei->i_dir_start_lookup = 0; > Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c > === > --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c 2007-05-25 > 18:05:28.0 +0200 > +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c 2007-05-25 18:05:40.0 > +0200 > @@ -3082,6 +3082,7 @@ > { > int err = 0; > > + inode->i_version++; > /* the do_update_inode consumes one bh->b_count */ > get_bh(iloc->bh); > > Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c > === > --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c 2007-05-25 > 18:05:28.0 +0200 > +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c 2007-05-25 18:05:40.0 > +0200 > @@ -2839,8 +2839,8 @@ > i_size_write(inode, off+len-towrite); > EXT4_I(inode)->i_disksize = inode->i_size; > } > - inode->i_version++; > inode->i_mtime = inode->i_ctime = CURRENT_TIME; > + inode->i_version = 1; > ext4_mark_inode_dirty(handle, inode); > mutex_unlock(&inode->i_mutex); > return len - towrite; I am not very clear about this part -- what is i_version being used for with quota? And shall we reset the counter to 1 for ext4_quota_write()? Mingming - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/2] i_version update
On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote: > Hi, > > This is an update of the i_version patch. > The i_version field is a 64bit counter that is set on every inode > creation and that is incremented every time the inode data is modified > (similarly to the "ctime" time-stamp). My understanding (please correct me if I'm wrong) is that the requirements are much more rigourous than simply incrementing an in memory counter on every change. i.e. the this counter has to survive server crashes intact so clients never see the counter go backwards. That means version number changes need to be journalled along with the operation that caused the change of the version number. > The aim is to fulfill a NFSv4 requirement for rfc3530: > "5.5. Mandatory Attributes - Definitions > Name # DataType Access Description > ___ > change3 uint64 READ A value created by the > server that the client can use to determine if file > data, directory contents or attributes of the object File data writes are included in this list of things that need to increment the version field. Hence to fulfill the crash requirement, that implies server data writes either need to be synchronous or journalled... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html