Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, 2017-05-12 at 12:21 -0400, J. Bruce Fields wrote: > On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote: > > On Thu, May 11 2017, J. Bruce Fields wrote: > > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > > > +{ > > > + u64 chattr; > > > + > > > + chattr = inode->i_ctime.tv_sec << 30; > > > + chattr += inode->i_ctime.tv_nsec; > > > + chattr += inode->i_version; > > > + return chattr; > > > > So if I chmod a file, all clients will need to flush the content from their > > cache? > > Maybe they already do? Maybe it is a boring corner case? > > Yeah, that's the assumption, maybe it's wrong. I can't recall > complaints about anyone bitten by that case. > I'm pretty sure that's required by the RFC. The change attribute changes with both data and metadata changes, and there is no way to tell what sort of change it was. You have to dump everything out of the cache when it changes. > > > /* > > > * Fill in the pre_op attr for the wcc data > > > */ > > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > > > fhp->fh_pre_mtime = inode->i_mtime; > > > fhp->fh_pre_ctime = inode->i_ctime; > > > fhp->fh_pre_size = inode->i_size; > > > - fhp->fh_pre_change = inode->i_version; > > > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > > > fhp->fh_pre_saved = true; > > > } > > > } > > > --- a/fs/nfsd/nfs3xdr.c > > > +++ b/fs/nfsd/nfs3xdr.c > > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > > > printk("nfsd: inode locked twice during operation.\n"); > > > > > > err = fh_getattr(fhp, >fh_post_attr); > > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > > > if (err) { > > > fhp->fh_post_saved = false; > > > /* Grab the ctime anyway - set_change_info might use it */ > > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > > index 26780d53a6f9..a09532d4a383 100644 > > > --- a/fs/nfsd/nfs4xdr.c > > > +++ b/fs/nfsd/nfs4xdr.c > > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct > > > kstat *stat, struct inode *inode, > > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > > > *p++ = 0; > > > } else if (IS_I_VERSION(inode)) { > > > - p = xdr_encode_hyper(p, inode->i_version); > > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > > > } else { > > > *p++ = cpu_to_be32(stat->ctime.tv_sec); > > > *p++ = cpu_to_be32(stat->ctime.tv_nsec); > > > > It is *really* confusing to find that fh_post_change is only set in nfs3 > > code, and only used in nfs4 code. > > Yup. > > > It is probably time to get a 'version' field in 'struct kstat'. > > The pre/post_wcc code doesn't seem to be doing an explicit stat, I > wonder if that matters? > Probably not for now. We only use this for namespace altering operations anyway (create, link, unlink, and rename). The post code does do a fh_getattr. It's only the pre-op i_version that comes out of the cache. Only btrfs, xfs, and ext4 have a real i_version counter today, and they just scrape that info out of the in-core inode. So while not completely atomic, you should see a difference in the change_info4 during any of those operations. FWIW, userland cephfs now supports a cluster-coherent change attribute, though the kernel client still needs some work before we can implement it there. Eventually we'll add that, and at that point we might need to have nfsd do a getattr in the pre part as well. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, 2017-05-12 at 12:21 -0400, J. Bruce Fields wrote: > On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote: > > On Thu, May 11 2017, J. Bruce Fields wrote: > > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > > > +{ > > > + u64 chattr; > > > + > > > + chattr = inode->i_ctime.tv_sec << 30; > > > + chattr += inode->i_ctime.tv_nsec; > > > + chattr += inode->i_version; > > > + return chattr; > > > > So if I chmod a file, all clients will need to flush the content from their > > cache? > > Maybe they already do? Maybe it is a boring corner case? > > Yeah, that's the assumption, maybe it's wrong. I can't recall > complaints about anyone bitten by that case. > I'm pretty sure that's required by the RFC. The change attribute changes with both data and metadata changes, and there is no way to tell what sort of change it was. You have to dump everything out of the cache when it changes. > > > /* > > > * Fill in the pre_op attr for the wcc data > > > */ > > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > > > fhp->fh_pre_mtime = inode->i_mtime; > > > fhp->fh_pre_ctime = inode->i_ctime; > > > fhp->fh_pre_size = inode->i_size; > > > - fhp->fh_pre_change = inode->i_version; > > > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > > > fhp->fh_pre_saved = true; > > > } > > > } > > > --- a/fs/nfsd/nfs3xdr.c > > > +++ b/fs/nfsd/nfs3xdr.c > > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > > > printk("nfsd: inode locked twice during operation.\n"); > > > > > > err = fh_getattr(fhp, >fh_post_attr); > > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > > > if (err) { > > > fhp->fh_post_saved = false; > > > /* Grab the ctime anyway - set_change_info might use it */ > > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > > index 26780d53a6f9..a09532d4a383 100644 > > > --- a/fs/nfsd/nfs4xdr.c > > > +++ b/fs/nfsd/nfs4xdr.c > > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct > > > kstat *stat, struct inode *inode, > > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > > > *p++ = 0; > > > } else if (IS_I_VERSION(inode)) { > > > - p = xdr_encode_hyper(p, inode->i_version); > > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > > > } else { > > > *p++ = cpu_to_be32(stat->ctime.tv_sec); > > > *p++ = cpu_to_be32(stat->ctime.tv_nsec); > > > > It is *really* confusing to find that fh_post_change is only set in nfs3 > > code, and only used in nfs4 code. > > Yup. > > > It is probably time to get a 'version' field in 'struct kstat'. > > The pre/post_wcc code doesn't seem to be doing an explicit stat, I > wonder if that matters? > Probably not for now. We only use this for namespace altering operations anyway (create, link, unlink, and rename). The post code does do a fh_getattr. It's only the pre-op i_version that comes out of the cache. Only btrfs, xfs, and ext4 have a real i_version counter today, and they just scrape that info out of the in-core inode. So while not completely atomic, you should see a difference in the change_info4 during any of those operations. FWIW, userland cephfs now supports a cluster-coherent change attribute, though the kernel client still needs some work before we can implement it there. Eventually we'll add that, and at that point we might need to have nfsd do a getattr in the pre part as well. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote: > On Thu, May 11 2017, J. Bruce Fields wrote: > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > > +{ > > + u64 chattr; > > + > > + chattr = inode->i_ctime.tv_sec << 30; > > + chattr += inode->i_ctime.tv_nsec; > > + chattr += inode->i_version; > > + return chattr; > > So if I chmod a file, all clients will need to flush the content from their > cache? > Maybe they already do? Maybe it is a boring corner case? Yeah, that's the assumption, maybe it's wrong. I can't recall complaints about anyone bitten by that case. > > /* > > * Fill in the pre_op attr for the wcc data > > */ > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > > fhp->fh_pre_mtime = inode->i_mtime; > > fhp->fh_pre_ctime = inode->i_ctime; > > fhp->fh_pre_size = inode->i_size; > > - fhp->fh_pre_change = inode->i_version; > > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > > fhp->fh_pre_saved = true; > > } > > } > > --- a/fs/nfsd/nfs3xdr.c > > +++ b/fs/nfsd/nfs3xdr.c > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > > printk("nfsd: inode locked twice during operation.\n"); > > > > err = fh_getattr(fhp, >fh_post_attr); > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > > if (err) { > > fhp->fh_post_saved = false; > > /* Grab the ctime anyway - set_change_info might use it */ > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > index 26780d53a6f9..a09532d4a383 100644 > > --- a/fs/nfsd/nfs4xdr.c > > +++ b/fs/nfsd/nfs4xdr.c > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > > *stat, struct inode *inode, > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > > *p++ = 0; > > } else if (IS_I_VERSION(inode)) { > > - p = xdr_encode_hyper(p, inode->i_version); > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > > } else { > > *p++ = cpu_to_be32(stat->ctime.tv_sec); > > *p++ = cpu_to_be32(stat->ctime.tv_nsec); > > It is *really* confusing to find that fh_post_change is only set in nfs3 > code, and only used in nfs4 code. Yup. > It is probably time to get a 'version' field in 'struct kstat'. The pre/post_wcc code doesn't seem to be doing an explicit stat, I wonder if that matters? --b. > That would allow this code to get a little cleaner. > > (to me, this exercise is just a reminder that the NFSv4 change attribute > is poorly designed ... so it just makes me grumpy). > > NeilBrown > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote: > On Thu, May 11 2017, J. Bruce Fields wrote: > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > > +{ > > + u64 chattr; > > + > > + chattr = inode->i_ctime.tv_sec << 30; > > + chattr += inode->i_ctime.tv_nsec; > > + chattr += inode->i_version; > > + return chattr; > > So if I chmod a file, all clients will need to flush the content from their > cache? > Maybe they already do? Maybe it is a boring corner case? Yeah, that's the assumption, maybe it's wrong. I can't recall complaints about anyone bitten by that case. > > /* > > * Fill in the pre_op attr for the wcc data > > */ > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > > fhp->fh_pre_mtime = inode->i_mtime; > > fhp->fh_pre_ctime = inode->i_ctime; > > fhp->fh_pre_size = inode->i_size; > > - fhp->fh_pre_change = inode->i_version; > > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > > fhp->fh_pre_saved = true; > > } > > } > > --- a/fs/nfsd/nfs3xdr.c > > +++ b/fs/nfsd/nfs3xdr.c > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > > printk("nfsd: inode locked twice during operation.\n"); > > > > err = fh_getattr(fhp, >fh_post_attr); > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > > if (err) { > > fhp->fh_post_saved = false; > > /* Grab the ctime anyway - set_change_info might use it */ > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > index 26780d53a6f9..a09532d4a383 100644 > > --- a/fs/nfsd/nfs4xdr.c > > +++ b/fs/nfsd/nfs4xdr.c > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > > *stat, struct inode *inode, > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > > *p++ = 0; > > } else if (IS_I_VERSION(inode)) { > > - p = xdr_encode_hyper(p, inode->i_version); > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > > } else { > > *p++ = cpu_to_be32(stat->ctime.tv_sec); > > *p++ = cpu_to_be32(stat->ctime.tv_nsec); > > It is *really* confusing to find that fh_post_change is only set in nfs3 > code, and only used in nfs4 code. Yup. > It is probably time to get a 'version' field in 'struct kstat'. The pre/post_wcc code doesn't seem to be doing an explicit stat, I wonder if that matters? --b. > That would allow this code to get a little cleaner. > > (to me, this exercise is just a reminder that the NFSv4 change attribute > is poorly designed ... so it just makes me grumpy). > > NeilBrown > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, May 12, 2017 at 07:01:25AM -0400, Jeff Layton wrote: > This looks reasonable to me (modulo Jan's comment about casting tv_sec > to u64). > > To be clear, I think this is mostly orthogonal to the changes that I was > originally proposing, right? I think we can still benefit from only > bumping and storing i_version values after they've been queried. Definitely, yes. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, May 12, 2017 at 07:01:25AM -0400, Jeff Layton wrote: > This looks reasonable to me (modulo Jan's comment about casting tv_sec > to u64). > > To be clear, I think this is mostly orthogonal to the changes that I was > originally proposing, right? I think we can still benefit from only > bumping and storing i_version values after they've been queried. Definitely, yes. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, May 12, 2017 at 10:27:54AM +0200, Jan Kara wrote: > On Thu 11-05-17 14:59:43, J. Bruce Fields wrote: > > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > > > 1) Keep i_version as is, make clients also check for i_ctime. > > > > > > That would be a protocol revision, which we'd definitely rather avoid. > > > > > > But can't we accomplish the same by using something like > > > > > > ctime * (some constant) + i_version > > > > > > ? > > > > > > >Pro: No on-disk format changes. > > > >Cons: After a crash, i_version can go backwards (but when file > > > > changes > > > >i_version, i_ctime pair should be still different) or not, data can > > > > be > > > >old or not. > > > > > > This is probably good enough for NFS purposes: typically on an NFS > > > filesystem, results of a read in the face of a concurrent write open are > > > undefined. And writers sync before close. > > > > > > So after a crash with a dirty inode, we're in a situation where an NFS > > > client still needs to resend some writes, sync, and close. I'm OK with > > > things being inconsistent during this window. > > > > > > I do expect things to return to normal once that client's has resent its > > > writes--hence the worry about actually resuing old values after boot > > > (such as if i_version regresses on boot and then increments back to the > > > same value after further writes). Factoring in ctime fixes that. > > > > So for now I'm thinking of just doing something like the following. > > > > Only nfsd needs it for now, but it could be moved to a vfs helper for > > statx, or for individual filesystems that want to do something > > different. (The NFSv4 client will want to use the server's change > > attribute instead, I think. And other filesystems might want to try > > something more ambitious like Neil's proposal.) > > > > --b. > > > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > > index 12feac6ee2fd..9636c9a60aba 100644 > > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > > index f84fe6bf9aee..14f09f1ef605 100644 > > --- a/fs/nfsd/nfsfh.h > > +++ b/fs/nfsd/nfsfh.h > > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > > fhp->fh_pre_saved = false; > > } > > > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > > +{ > > + u64 chattr; > > + > > + chattr = inode->i_ctime.tv_sec << 30; > > Won't this overflow on 32-bit archs? tv_sec seems to be defined as long? > Probably you need explicit (u64) cast... Otherwise I'm fine with this. Whoops, yes. Or just assign to chattr as a separate step. I'll fix that. --b. > > + chattr += inode->i_ctime.tv_nsec; > > + chattr += inode->i_version; > > + return chattr; > > +} > > + > > /* > > * Fill in the pre_op attr for the wcc data > > */ > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > > fhp->fh_pre_mtime = inode->i_mtime; > > fhp->fh_pre_ctime = inode->i_ctime; > > fhp->fh_pre_size = inode->i_size; > > - fhp->fh_pre_change = inode->i_version; > > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > > fhp->fh_pre_saved = true; > > } > > } > > --- a/fs/nfsd/nfs3xdr.c > > +++ b/fs/nfsd/nfs3xdr.c > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > > printk("nfsd: inode locked twice during operation.\n"); > > > > err = fh_getattr(fhp, >fh_post_attr); > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > > if (err) { > > fhp->fh_post_saved = false; > > /* Grab the ctime anyway - set_change_info might use it */ > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > index 26780d53a6f9..a09532d4a383 100644 > > --- a/fs/nfsd/nfs4xdr.c > > +++ b/fs/nfsd/nfs4xdr.c > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > > *stat, struct inode *inode, > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > > *p++ = 0; > > } else if (IS_I_VERSION(inode)) { > > - p = xdr_encode_hyper(p, inode->i_version); > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > > } else { > > *p++ = cpu_to_be32(stat->ctime.tv_sec); > > *p++ = cpu_to_be32(stat->ctime.tv_nsec); > -- > Jan Kara> SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, May 12, 2017 at 10:27:54AM +0200, Jan Kara wrote: > On Thu 11-05-17 14:59:43, J. Bruce Fields wrote: > > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > > > 1) Keep i_version as is, make clients also check for i_ctime. > > > > > > That would be a protocol revision, which we'd definitely rather avoid. > > > > > > But can't we accomplish the same by using something like > > > > > > ctime * (some constant) + i_version > > > > > > ? > > > > > > >Pro: No on-disk format changes. > > > >Cons: After a crash, i_version can go backwards (but when file > > > > changes > > > >i_version, i_ctime pair should be still different) or not, data can > > > > be > > > >old or not. > > > > > > This is probably good enough for NFS purposes: typically on an NFS > > > filesystem, results of a read in the face of a concurrent write open are > > > undefined. And writers sync before close. > > > > > > So after a crash with a dirty inode, we're in a situation where an NFS > > > client still needs to resend some writes, sync, and close. I'm OK with > > > things being inconsistent during this window. > > > > > > I do expect things to return to normal once that client's has resent its > > > writes--hence the worry about actually resuing old values after boot > > > (such as if i_version regresses on boot and then increments back to the > > > same value after further writes). Factoring in ctime fixes that. > > > > So for now I'm thinking of just doing something like the following. > > > > Only nfsd needs it for now, but it could be moved to a vfs helper for > > statx, or for individual filesystems that want to do something > > different. (The NFSv4 client will want to use the server's change > > attribute instead, I think. And other filesystems might want to try > > something more ambitious like Neil's proposal.) > > > > --b. > > > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > > index 12feac6ee2fd..9636c9a60aba 100644 > > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > > index f84fe6bf9aee..14f09f1ef605 100644 > > --- a/fs/nfsd/nfsfh.h > > +++ b/fs/nfsd/nfsfh.h > > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > > fhp->fh_pre_saved = false; > > } > > > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > > +{ > > + u64 chattr; > > + > > + chattr = inode->i_ctime.tv_sec << 30; > > Won't this overflow on 32-bit archs? tv_sec seems to be defined as long? > Probably you need explicit (u64) cast... Otherwise I'm fine with this. Whoops, yes. Or just assign to chattr as a separate step. I'll fix that. --b. > > + chattr += inode->i_ctime.tv_nsec; > > + chattr += inode->i_version; > > + return chattr; > > +} > > + > > /* > > * Fill in the pre_op attr for the wcc data > > */ > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > > fhp->fh_pre_mtime = inode->i_mtime; > > fhp->fh_pre_ctime = inode->i_ctime; > > fhp->fh_pre_size = inode->i_size; > > - fhp->fh_pre_change = inode->i_version; > > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > > fhp->fh_pre_saved = true; > > } > > } > > --- a/fs/nfsd/nfs3xdr.c > > +++ b/fs/nfsd/nfs3xdr.c > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > > printk("nfsd: inode locked twice during operation.\n"); > > > > err = fh_getattr(fhp, >fh_post_attr); > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > > if (err) { > > fhp->fh_post_saved = false; > > /* Grab the ctime anyway - set_change_info might use it */ > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > index 26780d53a6f9..a09532d4a383 100644 > > --- a/fs/nfsd/nfs4xdr.c > > +++ b/fs/nfsd/nfs4xdr.c > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > > *stat, struct inode *inode, > > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > > *p++ = 0; > > } else if (IS_I_VERSION(inode)) { > > - p = xdr_encode_hyper(p, inode->i_version); > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > > } else { > > *p++ = cpu_to_be32(stat->ctime.tv_sec); > > *p++ = cpu_to_be32(stat->ctime.tv_nsec); > -- > Jan Kara > SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-05-11 at 14:59 -0400, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > > 1) Keep i_version as is, make clients also check for i_ctime. > > > > That would be a protocol revision, which we'd definitely rather avoid. > > > > But can't we accomplish the same by using something like > > > > ctime * (some constant) + i_version > > > > ? > > > > >Pro: No on-disk format changes. > > >Cons: After a crash, i_version can go backwards (but when file changes > > >i_version, i_ctime pair should be still different) or not, data can be > > >old or not. > > > > This is probably good enough for NFS purposes: typically on an NFS > > filesystem, results of a read in the face of a concurrent write open are > > undefined. And writers sync before close. > > > > So after a crash with a dirty inode, we're in a situation where an NFS > > client still needs to resend some writes, sync, and close. I'm OK with > > things being inconsistent during this window. > > > > I do expect things to return to normal once that client's has resent its > > writes--hence the worry about actually resuing old values after boot > > (such as if i_version regresses on boot and then increments back to the > > same value after further writes). Factoring in ctime fixes that. > > So for now I'm thinking of just doing something like the following. > > Only nfsd needs it for now, but it could be moved to a vfs helper for > statx, or for individual filesystems that want to do something > different. (The NFSv4 client will want to use the server's change > attribute instead, I think. And other filesystems might want to try > something more ambitious like Neil's proposal.) > > --b. > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 12feac6ee2fd..9636c9a60aba 100644 > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f84fe6bf9aee..14f09f1ef605 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > fhp->fh_pre_saved = false; > } > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > +{ > + u64 chattr; > + > + chattr = inode->i_ctime.tv_sec << 30; > + chattr += inode->i_ctime.tv_nsec; > + chattr += inode->i_version; > + return chattr; > +} > + > /* > * Fill in the pre_op attr for the wcc data > */ > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > fhp->fh_pre_mtime = inode->i_mtime; > fhp->fh_pre_ctime = inode->i_ctime; > fhp->fh_pre_size = inode->i_size; > - fhp->fh_pre_change = inode->i_version; > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > fhp->fh_pre_saved = true; > } > } > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > printk("nfsd: inode locked twice during operation.\n"); > > err = fh_getattr(fhp, >fh_post_attr); > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > if (err) { > fhp->fh_post_saved = false; > /* Grab the ctime anyway - set_change_info might use it */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 26780d53a6f9..a09532d4a383 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > *stat, struct inode *inode, > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > *p++ = 0; > } else if (IS_I_VERSION(inode)) { > - p = xdr_encode_hyper(p, inode->i_version); > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > } else { > *p++ = cpu_to_be32(stat->ctime.tv_sec); > *p++ = cpu_to_be32(stat->ctime.tv_nsec); Sorry I've been MIA on this discussion. I've had a very busy spring... This looks reasonable to me (modulo Jan's comment about casting tv_sec to u64). To be clear, I think this is mostly orthogonal to the changes that I was originally proposing, right? I think we can still benefit from only bumping and storing i_version values after they've been queried. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-05-11 at 14:59 -0400, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > > 1) Keep i_version as is, make clients also check for i_ctime. > > > > That would be a protocol revision, which we'd definitely rather avoid. > > > > But can't we accomplish the same by using something like > > > > ctime * (some constant) + i_version > > > > ? > > > > >Pro: No on-disk format changes. > > >Cons: After a crash, i_version can go backwards (but when file changes > > >i_version, i_ctime pair should be still different) or not, data can be > > >old or not. > > > > This is probably good enough for NFS purposes: typically on an NFS > > filesystem, results of a read in the face of a concurrent write open are > > undefined. And writers sync before close. > > > > So after a crash with a dirty inode, we're in a situation where an NFS > > client still needs to resend some writes, sync, and close. I'm OK with > > things being inconsistent during this window. > > > > I do expect things to return to normal once that client's has resent its > > writes--hence the worry about actually resuing old values after boot > > (such as if i_version regresses on boot and then increments back to the > > same value after further writes). Factoring in ctime fixes that. > > So for now I'm thinking of just doing something like the following. > > Only nfsd needs it for now, but it could be moved to a vfs helper for > statx, or for individual filesystems that want to do something > different. (The NFSv4 client will want to use the server's change > attribute instead, I think. And other filesystems might want to try > something more ambitious like Neil's proposal.) > > --b. > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 12feac6ee2fd..9636c9a60aba 100644 > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f84fe6bf9aee..14f09f1ef605 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > fhp->fh_pre_saved = false; > } > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > +{ > + u64 chattr; > + > + chattr = inode->i_ctime.tv_sec << 30; > + chattr += inode->i_ctime.tv_nsec; > + chattr += inode->i_version; > + return chattr; > +} > + > /* > * Fill in the pre_op attr for the wcc data > */ > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > fhp->fh_pre_mtime = inode->i_mtime; > fhp->fh_pre_ctime = inode->i_ctime; > fhp->fh_pre_size = inode->i_size; > - fhp->fh_pre_change = inode->i_version; > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > fhp->fh_pre_saved = true; > } > } > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > printk("nfsd: inode locked twice during operation.\n"); > > err = fh_getattr(fhp, >fh_post_attr); > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > if (err) { > fhp->fh_post_saved = false; > /* Grab the ctime anyway - set_change_info might use it */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 26780d53a6f9..a09532d4a383 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > *stat, struct inode *inode, > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > *p++ = 0; > } else if (IS_I_VERSION(inode)) { > - p = xdr_encode_hyper(p, inode->i_version); > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > } else { > *p++ = cpu_to_be32(stat->ctime.tv_sec); > *p++ = cpu_to_be32(stat->ctime.tv_nsec); Sorry I've been MIA on this discussion. I've had a very busy spring... This looks reasonable to me (modulo Jan's comment about casting tv_sec to u64). To be clear, I think this is mostly orthogonal to the changes that I was originally proposing, right? I think we can still benefit from only bumping and storing i_version values after they've been queried. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu 11-05-17 14:59:43, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > > 1) Keep i_version as is, make clients also check for i_ctime. > > > > That would be a protocol revision, which we'd definitely rather avoid. > > > > But can't we accomplish the same by using something like > > > > ctime * (some constant) + i_version > > > > ? > > > > >Pro: No on-disk format changes. > > >Cons: After a crash, i_version can go backwards (but when file changes > > >i_version, i_ctime pair should be still different) or not, data can be > > >old or not. > > > > This is probably good enough for NFS purposes: typically on an NFS > > filesystem, results of a read in the face of a concurrent write open are > > undefined. And writers sync before close. > > > > So after a crash with a dirty inode, we're in a situation where an NFS > > client still needs to resend some writes, sync, and close. I'm OK with > > things being inconsistent during this window. > > > > I do expect things to return to normal once that client's has resent its > > writes--hence the worry about actually resuing old values after boot > > (such as if i_version regresses on boot and then increments back to the > > same value after further writes). Factoring in ctime fixes that. > > So for now I'm thinking of just doing something like the following. > > Only nfsd needs it for now, but it could be moved to a vfs helper for > statx, or for individual filesystems that want to do something > different. (The NFSv4 client will want to use the server's change > attribute instead, I think. And other filesystems might want to try > something more ambitious like Neil's proposal.) > > --b. > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 12feac6ee2fd..9636c9a60aba 100644 > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f84fe6bf9aee..14f09f1ef605 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > fhp->fh_pre_saved = false; > } > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > +{ > + u64 chattr; > + > + chattr = inode->i_ctime.tv_sec << 30; Won't this overflow on 32-bit archs? tv_sec seems to be defined as long? Probably you need explicit (u64) cast... Otherwise I'm fine with this. Honza > + chattr += inode->i_ctime.tv_nsec; > + chattr += inode->i_version; > + return chattr; > +} > + > /* > * Fill in the pre_op attr for the wcc data > */ > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > fhp->fh_pre_mtime = inode->i_mtime; > fhp->fh_pre_ctime = inode->i_ctime; > fhp->fh_pre_size = inode->i_size; > - fhp->fh_pre_change = inode->i_version; > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > fhp->fh_pre_saved = true; > } > } > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > printk("nfsd: inode locked twice during operation.\n"); > > err = fh_getattr(fhp, >fh_post_attr); > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > if (err) { > fhp->fh_post_saved = false; > /* Grab the ctime anyway - set_change_info might use it */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 26780d53a6f9..a09532d4a383 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > *stat, struct inode *inode, > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > *p++ = 0; > } else if (IS_I_VERSION(inode)) { > - p = xdr_encode_hyper(p, inode->i_version); > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > } else { > *p++ = cpu_to_be32(stat->ctime.tv_sec); > *p++ = cpu_to_be32(stat->ctime.tv_nsec); -- Jan KaraSUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu 11-05-17 14:59:43, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > > 1) Keep i_version as is, make clients also check for i_ctime. > > > > That would be a protocol revision, which we'd definitely rather avoid. > > > > But can't we accomplish the same by using something like > > > > ctime * (some constant) + i_version > > > > ? > > > > >Pro: No on-disk format changes. > > >Cons: After a crash, i_version can go backwards (but when file changes > > >i_version, i_ctime pair should be still different) or not, data can be > > >old or not. > > > > This is probably good enough for NFS purposes: typically on an NFS > > filesystem, results of a read in the face of a concurrent write open are > > undefined. And writers sync before close. > > > > So after a crash with a dirty inode, we're in a situation where an NFS > > client still needs to resend some writes, sync, and close. I'm OK with > > things being inconsistent during this window. > > > > I do expect things to return to normal once that client's has resent its > > writes--hence the worry about actually resuing old values after boot > > (such as if i_version regresses on boot and then increments back to the > > same value after further writes). Factoring in ctime fixes that. > > So for now I'm thinking of just doing something like the following. > > Only nfsd needs it for now, but it could be moved to a vfs helper for > statx, or for individual filesystems that want to do something > different. (The NFSv4 client will want to use the server's change > attribute instead, I think. And other filesystems might want to try > something more ambitious like Neil's proposal.) > > --b. > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 12feac6ee2fd..9636c9a60aba 100644 > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f84fe6bf9aee..14f09f1ef605 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > fhp->fh_pre_saved = false; > } > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > +{ > + u64 chattr; > + > + chattr = inode->i_ctime.tv_sec << 30; Won't this overflow on 32-bit archs? tv_sec seems to be defined as long? Probably you need explicit (u64) cast... Otherwise I'm fine with this. Honza > + chattr += inode->i_ctime.tv_nsec; > + chattr += inode->i_version; > + return chattr; > +} > + > /* > * Fill in the pre_op attr for the wcc data > */ > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > fhp->fh_pre_mtime = inode->i_mtime; > fhp->fh_pre_ctime = inode->i_ctime; > fhp->fh_pre_size = inode->i_size; > - fhp->fh_pre_change = inode->i_version; > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > fhp->fh_pre_saved = true; > } > } > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > printk("nfsd: inode locked twice during operation.\n"); > > err = fh_getattr(fhp, >fh_post_attr); > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > if (err) { > fhp->fh_post_saved = false; > /* Grab the ctime anyway - set_change_info might use it */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 26780d53a6f9..a09532d4a383 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > *stat, struct inode *inode, > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > *p++ = 0; > } else if (IS_I_VERSION(inode)) { > - p = xdr_encode_hyper(p, inode->i_version); > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > } else { > *p++ = cpu_to_be32(stat->ctime.tv_sec); > *p++ = cpu_to_be32(stat->ctime.tv_nsec); -- Jan Kara SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, May 11 2017, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: >> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: >> > 1) Keep i_version as is, make clients also check for i_ctime. >> >> That would be a protocol revision, which we'd definitely rather avoid. >> >> But can't we accomplish the same by using something like >> >> ctime * (some constant) + i_version >> >> ? >> >> >Pro: No on-disk format changes. >> >Cons: After a crash, i_version can go backwards (but when file changes >> >i_version, i_ctime pair should be still different) or not, data can be >> >old or not. >> >> This is probably good enough for NFS purposes: typically on an NFS >> filesystem, results of a read in the face of a concurrent write open are >> undefined. And writers sync before close. >> >> So after a crash with a dirty inode, we're in a situation where an NFS >> client still needs to resend some writes, sync, and close. I'm OK with >> things being inconsistent during this window. >> >> I do expect things to return to normal once that client's has resent its >> writes--hence the worry about actually resuing old values after boot >> (such as if i_version regresses on boot and then increments back to the >> same value after further writes). Factoring in ctime fixes that. > > So for now I'm thinking of just doing something like the following. > > Only nfsd needs it for now, but it could be moved to a vfs helper for > statx, or for individual filesystems that want to do something > different. (The NFSv4 client will want to use the server's change > attribute instead, I think. And other filesystems might want to try > something more ambitious like Neil's proposal.) > > --b. > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 12feac6ee2fd..9636c9a60aba 100644 > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f84fe6bf9aee..14f09f1ef605 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > fhp->fh_pre_saved = false; > } > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > +{ > + u64 chattr; > + > + chattr = inode->i_ctime.tv_sec << 30; > + chattr += inode->i_ctime.tv_nsec; > + chattr += inode->i_version; > + return chattr; So if I chmod a file, all clients will need to flush the content from their cache? Maybe they already do? Maybe it is a boring corner case? > +} > + > /* > * Fill in the pre_op attr for the wcc data > */ > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > fhp->fh_pre_mtime = inode->i_mtime; > fhp->fh_pre_ctime = inode->i_ctime; > fhp->fh_pre_size = inode->i_size; > - fhp->fh_pre_change = inode->i_version; > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > fhp->fh_pre_saved = true; > } > } > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > printk("nfsd: inode locked twice during operation.\n"); > > err = fh_getattr(fhp, >fh_post_attr); > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > if (err) { > fhp->fh_post_saved = false; > /* Grab the ctime anyway - set_change_info might use it */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 26780d53a6f9..a09532d4a383 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > *stat, struct inode *inode, > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > *p++ = 0; > } else if (IS_I_VERSION(inode)) { > - p = xdr_encode_hyper(p, inode->i_version); > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > } else { > *p++ = cpu_to_be32(stat->ctime.tv_sec); > *p++ = cpu_to_be32(stat->ctime.tv_nsec); It is *really* confusing to find that fh_post_change is only set in nfs3 code, and only used in nfs4 code. It is probably time to get a 'version' field in 'struct kstat'. That would allow this code to get a little cleaner. (to me, this exercise is just a reminder that the NFSv4 change attribute is poorly designed ... so it just makes me grumpy). NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, May 11 2017, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: >> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: >> > 1) Keep i_version as is, make clients also check for i_ctime. >> >> That would be a protocol revision, which we'd definitely rather avoid. >> >> But can't we accomplish the same by using something like >> >> ctime * (some constant) + i_version >> >> ? >> >> >Pro: No on-disk format changes. >> >Cons: After a crash, i_version can go backwards (but when file changes >> >i_version, i_ctime pair should be still different) or not, data can be >> >old or not. >> >> This is probably good enough for NFS purposes: typically on an NFS >> filesystem, results of a read in the face of a concurrent write open are >> undefined. And writers sync before close. >> >> So after a crash with a dirty inode, we're in a situation where an NFS >> client still needs to resend some writes, sync, and close. I'm OK with >> things being inconsistent during this window. >> >> I do expect things to return to normal once that client's has resent its >> writes--hence the worry about actually resuing old values after boot >> (such as if i_version regresses on boot and then increments back to the >> same value after further writes). Factoring in ctime fixes that. > > So for now I'm thinking of just doing something like the following. > > Only nfsd needs it for now, but it could be moved to a vfs helper for > statx, or for individual filesystems that want to do something > different. (The NFSv4 client will want to use the server's change > attribute instead, I think. And other filesystems might want to try > something more ambitious like Neil's proposal.) > > --b. > > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 12feac6ee2fd..9636c9a60aba 100644 > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f84fe6bf9aee..14f09f1ef605 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) > fhp->fh_pre_saved = false; > } > > +static inline u64 nfsd4_change_attribute(struct inode *inode) > +{ > + u64 chattr; > + > + chattr = inode->i_ctime.tv_sec << 30; > + chattr += inode->i_ctime.tv_nsec; > + chattr += inode->i_version; > + return chattr; So if I chmod a file, all clients will need to flush the content from their cache? Maybe they already do? Maybe it is a boring corner case? > +} > + > /* > * Fill in the pre_op attr for the wcc data > */ > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) > fhp->fh_pre_mtime = inode->i_mtime; > fhp->fh_pre_ctime = inode->i_ctime; > fhp->fh_pre_size = inode->i_size; > - fhp->fh_pre_change = inode->i_version; > + fhp->fh_pre_change = nfsd4_change_attribute(inode); > fhp->fh_pre_saved = true; > } > } > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) > printk("nfsd: inode locked twice during operation.\n"); > > err = fh_getattr(fhp, >fh_post_attr); > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); > if (err) { > fhp->fh_post_saved = false; > /* Grab the ctime anyway - set_change_info might use it */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 26780d53a6f9..a09532d4a383 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat > *stat, struct inode *inode, > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); > *p++ = 0; > } else if (IS_I_VERSION(inode)) { > - p = xdr_encode_hyper(p, inode->i_version); > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); > } else { > *p++ = cpu_to_be32(stat->ctime.tv_sec); > *p++ = cpu_to_be32(stat->ctime.tv_nsec); It is *really* confusing to find that fh_post_change is only set in nfs3 code, and only used in nfs4 code. It is probably time to get a 'version' field in 'struct kstat'. That would allow this code to get a little cleaner. (to me, this exercise is just a reminder that the NFSv4 change attribute is poorly designed ... so it just makes me grumpy). NeilBrown > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: PGP signature
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > 1) Keep i_version as is, make clients also check for i_ctime. > > That would be a protocol revision, which we'd definitely rather avoid. > > But can't we accomplish the same by using something like > > ctime * (some constant) + i_version > > ? > > >Pro: No on-disk format changes. > >Cons: After a crash, i_version can go backwards (but when file changes > >i_version, i_ctime pair should be still different) or not, data can be > >old or not. > > This is probably good enough for NFS purposes: typically on an NFS > filesystem, results of a read in the face of a concurrent write open are > undefined. And writers sync before close. > > So after a crash with a dirty inode, we're in a situation where an NFS > client still needs to resend some writes, sync, and close. I'm OK with > things being inconsistent during this window. > > I do expect things to return to normal once that client's has resent its > writes--hence the worry about actually resuing old values after boot > (such as if i_version regresses on boot and then increments back to the > same value after further writes). Factoring in ctime fixes that. So for now I'm thinking of just doing something like the following. Only nfsd needs it for now, but it could be moved to a vfs helper for statx, or for individual filesystems that want to do something different. (The NFSv4 client will want to use the server's change attribute instead, I think. And other filesystems might want to try something more ambitious like Neil's proposal.) --b. diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index 12feac6ee2fd..9636c9a60aba 100644 diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h index f84fe6bf9aee..14f09f1ef605 100644 --- a/fs/nfsd/nfsfh.h +++ b/fs/nfsd/nfsfh.h @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) fhp->fh_pre_saved = false; } +static inline u64 nfsd4_change_attribute(struct inode *inode) +{ + u64 chattr; + + chattr = inode->i_ctime.tv_sec << 30; + chattr += inode->i_ctime.tv_nsec; + chattr += inode->i_version; + return chattr; +} + /* * Fill in the pre_op attr for the wcc data */ @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) fhp->fh_pre_mtime = inode->i_mtime; fhp->fh_pre_ctime = inode->i_ctime; fhp->fh_pre_size = inode->i_size; - fhp->fh_pre_change = inode->i_version; + fhp->fh_pre_change = nfsd4_change_attribute(inode); fhp->fh_pre_saved = true; } } --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) printk("nfsd: inode locked twice during operation.\n"); err = fh_getattr(fhp, >fh_post_attr); - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); if (err) { fhp->fh_post_saved = false; /* Grab the ctime anyway - set_change_info might use it */ diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 26780d53a6f9..a09532d4a383 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode, *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); *p++ = 0; } else if (IS_I_VERSION(inode)) { - p = xdr_encode_hyper(p, inode->i_version); + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); } else { *p++ = cpu_to_be32(stat->ctime.tv_sec); *p++ = cpu_to_be32(stat->ctime.tv_nsec);
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote: > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > > 1) Keep i_version as is, make clients also check for i_ctime. > > That would be a protocol revision, which we'd definitely rather avoid. > > But can't we accomplish the same by using something like > > ctime * (some constant) + i_version > > ? > > >Pro: No on-disk format changes. > >Cons: After a crash, i_version can go backwards (but when file changes > >i_version, i_ctime pair should be still different) or not, data can be > >old or not. > > This is probably good enough for NFS purposes: typically on an NFS > filesystem, results of a read in the face of a concurrent write open are > undefined. And writers sync before close. > > So after a crash with a dirty inode, we're in a situation where an NFS > client still needs to resend some writes, sync, and close. I'm OK with > things being inconsistent during this window. > > I do expect things to return to normal once that client's has resent its > writes--hence the worry about actually resuing old values after boot > (such as if i_version regresses on boot and then increments back to the > same value after further writes). Factoring in ctime fixes that. So for now I'm thinking of just doing something like the following. Only nfsd needs it for now, but it could be moved to a vfs helper for statx, or for individual filesystems that want to do something different. (The NFSv4 client will want to use the server's change attribute instead, I think. And other filesystems might want to try something more ambitious like Neil's proposal.) --b. diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index 12feac6ee2fd..9636c9a60aba 100644 diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h index f84fe6bf9aee..14f09f1ef605 100644 --- a/fs/nfsd/nfsfh.h +++ b/fs/nfsd/nfsfh.h @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp) fhp->fh_pre_saved = false; } +static inline u64 nfsd4_change_attribute(struct inode *inode) +{ + u64 chattr; + + chattr = inode->i_ctime.tv_sec << 30; + chattr += inode->i_ctime.tv_nsec; + chattr += inode->i_version; + return chattr; +} + /* * Fill in the pre_op attr for the wcc data */ @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp) fhp->fh_pre_mtime = inode->i_mtime; fhp->fh_pre_ctime = inode->i_ctime; fhp->fh_pre_size = inode->i_size; - fhp->fh_pre_change = inode->i_version; + fhp->fh_pre_change = nfsd4_change_attribute(inode); fhp->fh_pre_saved = true; } } --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp) printk("nfsd: inode locked twice during operation.\n"); err = fh_getattr(fhp, >fh_post_attr); - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version; + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry)); if (err) { fhp->fh_post_saved = false; /* Grab the ctime anyway - set_change_info might use it */ diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 26780d53a6f9..a09532d4a383 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat *stat, struct inode *inode, *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time)); *p++ = 0; } else if (IS_I_VERSION(inode)) { - p = xdr_encode_hyper(p, inode->i_version); + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode)); } else { *p++ = cpu_to_be32(stat->ctime.tv_sec); *p++ = cpu_to_be32(stat->ctime.tv_nsec);
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu 06-04-17 11:12:02, NeilBrown wrote: > On Wed, Apr 05 2017, Jan Kara wrote: > >> If you want to ensure read-only files can remain cached over a crash, > >> then you would have to mark a file in some way on stable storage > >> *before* allowing any change. > >> e.g. you could use the lsb. Odd i_versions might have been changed > >> recently and crash-count*large-number needs to be added. > >> Even i_versions have not been changed recently and nothing need be > >> added. > >> > >> If you want to change a file with an even i_version, you subtract > >> crash-count*large-number > >> to the i_version, then set lsb. This is written to stable storage before > >> the change. > >> > >> If a file has not been changed for a while, you can add > >> crash-count*large-number > >> and clear lsb. > >> > >> The lsb of the i_version would be for internal use only. It would not > >> be visible outside the filesystem. > >> > >> It feels a bit clunky, but I think it would work and is the best > >> combination of Jan's idea and your requirement. > >> The biggest cost would be switching to 'odd' before an changes, and the > >> unknown is when does it make sense to switch to 'even'. > > > > Well, there is also a problem that you would need to somehow remember with > > which 'crash count' the i_version has been previously reported as that is > > not stored on disk with my scheme. So I don't think we can easily use your > > scheme. > > I don't think there is a problem here maybe I didn't explain > properly or something. > > I'm assuming there is a crash-count that is stored once per filesystem. > This might be a disk-format change, or maybe the "Last checked" time > could be used with ext4 (that is a bit horrible though). > > Every on-disk i_version has a flag to choose between: > - use this number as it is, but update it on-disk before any change > - add multiple of current crash-count to this number before use. > If you crash during an update, the i_version is thus automatically > increased. > > To change from the first option to the second option you subtract the > multiple of the current crash-count (which might make the stored > i_version negative), and flip the bit. > To change from the second option to the first, you add the multiple > of the current crash-count, and flip the bit. > In each case, the externally visible i_version does not change. > Nothing needs to be stored except the per-inode i_version and the per-fs > crash_count. Right, I didn't realize you would subtract crash counter when flipping the bit and then add it back when flipping again. That would work. > > So the options we have are: > > > > 1) Keep i_version as is, make clients also check for i_ctime. > >Pro: No on-disk format changes. > >Cons: After a crash, i_version can go backwards (but when file changes > >i_version, i_ctime pair should be still different) or not, data can be > >old or not. > > I like to think of this approach as using the i_version as an extension > to the i_ctime. > i_ctime doesn't necessarily change on every file modification, either > because it is not a modification that is meant to change i_ctime, or > because i_ctime doesn't have the resolution to show a very small change > in time, or because the clock that is used to update i_ctime doesn't > have much resolution. > So when a change happens, if the stored c_time changes, set i_version to > zero, otherwise increment i_version. > Then the externally visible i-version is a combination of the stored > c_time and the stored i_version. > If you only used 1-second ctime resolution for versioning purposes, you > could provide a 64bit i_version as 34 bits of ctime and 30 bits of > changes-in-one-second. > It is important that the resolution of ctime used is less that the > fastest possible restart after a crash. > > I don't think that i_version going backwards should be a problem, as > long as an old version means exactly the same old data. Presumably > journalling would ensure that the data and ctime/version are updated > atomically. So as Dave and I wrote earlier in this thread, journalling does not ensure data vs ctime/version consistency (well, except for ext4 in data=journal mode but people rarely run that due to performance implications). So you can get old data and new version as well as new data and old version after a crash. The only thing filesystems guarantee is that you will not see uninitialized blocks and that fsync makes both data & ctime/version persistent. But as Bruce wrote for NFS open-to-close semantics this may be actually good enough. Honza -- Jan KaraSUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu 06-04-17 11:12:02, NeilBrown wrote: > On Wed, Apr 05 2017, Jan Kara wrote: > >> If you want to ensure read-only files can remain cached over a crash, > >> then you would have to mark a file in some way on stable storage > >> *before* allowing any change. > >> e.g. you could use the lsb. Odd i_versions might have been changed > >> recently and crash-count*large-number needs to be added. > >> Even i_versions have not been changed recently and nothing need be > >> added. > >> > >> If you want to change a file with an even i_version, you subtract > >> crash-count*large-number > >> to the i_version, then set lsb. This is written to stable storage before > >> the change. > >> > >> If a file has not been changed for a while, you can add > >> crash-count*large-number > >> and clear lsb. > >> > >> The lsb of the i_version would be for internal use only. It would not > >> be visible outside the filesystem. > >> > >> It feels a bit clunky, but I think it would work and is the best > >> combination of Jan's idea and your requirement. > >> The biggest cost would be switching to 'odd' before an changes, and the > >> unknown is when does it make sense to switch to 'even'. > > > > Well, there is also a problem that you would need to somehow remember with > > which 'crash count' the i_version has been previously reported as that is > > not stored on disk with my scheme. So I don't think we can easily use your > > scheme. > > I don't think there is a problem here maybe I didn't explain > properly or something. > > I'm assuming there is a crash-count that is stored once per filesystem. > This might be a disk-format change, or maybe the "Last checked" time > could be used with ext4 (that is a bit horrible though). > > Every on-disk i_version has a flag to choose between: > - use this number as it is, but update it on-disk before any change > - add multiple of current crash-count to this number before use. > If you crash during an update, the i_version is thus automatically > increased. > > To change from the first option to the second option you subtract the > multiple of the current crash-count (which might make the stored > i_version negative), and flip the bit. > To change from the second option to the first, you add the multiple > of the current crash-count, and flip the bit. > In each case, the externally visible i_version does not change. > Nothing needs to be stored except the per-inode i_version and the per-fs > crash_count. Right, I didn't realize you would subtract crash counter when flipping the bit and then add it back when flipping again. That would work. > > So the options we have are: > > > > 1) Keep i_version as is, make clients also check for i_ctime. > >Pro: No on-disk format changes. > >Cons: After a crash, i_version can go backwards (but when file changes > >i_version, i_ctime pair should be still different) or not, data can be > >old or not. > > I like to think of this approach as using the i_version as an extension > to the i_ctime. > i_ctime doesn't necessarily change on every file modification, either > because it is not a modification that is meant to change i_ctime, or > because i_ctime doesn't have the resolution to show a very small change > in time, or because the clock that is used to update i_ctime doesn't > have much resolution. > So when a change happens, if the stored c_time changes, set i_version to > zero, otherwise increment i_version. > Then the externally visible i-version is a combination of the stored > c_time and the stored i_version. > If you only used 1-second ctime resolution for versioning purposes, you > could provide a 64bit i_version as 34 bits of ctime and 30 bits of > changes-in-one-second. > It is important that the resolution of ctime used is less that the > fastest possible restart after a crash. > > I don't think that i_version going backwards should be a problem, as > long as an old version means exactly the same old data. Presumably > journalling would ensure that the data and ctime/version are updated > atomically. So as Dave and I wrote earlier in this thread, journalling does not ensure data vs ctime/version consistency (well, except for ext4 in data=journal mode but people rarely run that due to performance implications). So you can get old data and new version as well as new data and old version after a crash. The only thing filesystems guarantee is that you will not see uninitialized blocks and that fsync makes both data & ctime/version persistent. But as Bruce wrote for NFS open-to-close semantics this may be actually good enough. Honza -- Jan Kara SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05 2017, Jan Kara wrote: > On Wed 05-04-17 11:43:32, NeilBrown wrote: >> On Tue, Apr 04 2017, J. Bruce Fields wrote: >> >> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: >> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: >> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> >> > > > Because if above is acceptable we could make reported i_version to >> >> > > > be a sum >> >> > > > of "superblock crash counter" and "inode i_version". We increment >> >> > > > "superblock crash counter" whenever we detect unclean filesystem >> >> > > > shutdown. >> >> > > > That way after a crash we are guaranteed each inode will report new >> >> > > > i_version (the sum would probably have to look like "superblock >> >> > > > crash >> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing >> >> > > > possible >> >> > > > i_version numbers we gave away but did not write to disk but >> >> > > > still...). >> >> > > > Thoughts? >> >> > >> >> > How hard is this for filesystems to support? Do they need an on-disk >> >> > format change to keep track of the crash counter? Maybe not, maybe the >> >> > high bits of the i_version counters are all they need. >> >> > >> >> >> >> Yeah, I imagine we'd need a on-disk change for this unless there's >> >> something already present that we could use in place of a crash counter. >> > >> > We could consider using the current time instead. So, put the current >> > time (or time of last boot, or this inode's ctime, or something) in the >> > high bits of the change attribute, and keep the low bits as a counter. >> >> This is a very different proposal. >> I don't think Jan was suggesting that the i_version be split into two >> bit fields, one the change-counter and one the crash-counter. >> Rather, the crash-counter was multiplied by a large-number and added to >> the change-counter with the expectation that while not ever >> change-counter landed on disk, at least 1 in every large-number would. >> So after each crash we effectively add large-number to the >> change-counter, and can be sure that number hasn't been used already. > > Yes, that was my thinking. > >> To store the crash-counter in each inode (which does appeal) you would >> need to be able to remove it before adding the new crash counter, and >> that requires bit-fields. Maybe there are enough bits. > > Furthermore you'd have a potential problem that you need to change > i_version on disk just because you are reading after a crash and such > changes tend to be problematic (think of read-only mounts and stuff like > that). > >> If you want to ensure read-only files can remain cached over a crash, >> then you would have to mark a file in some way on stable storage >> *before* allowing any change. >> e.g. you could use the lsb. Odd i_versions might have been changed >> recently and crash-count*large-number needs to be added. >> Even i_versions have not been changed recently and nothing need be >> added. >> >> If you want to change a file with an even i_version, you subtract >> crash-count*large-number >> to the i_version, then set lsb. This is written to stable storage before >> the change. >> >> If a file has not been changed for a while, you can add >> crash-count*large-number >> and clear lsb. >> >> The lsb of the i_version would be for internal use only. It would not >> be visible outside the filesystem. >> >> It feels a bit clunky, but I think it would work and is the best >> combination of Jan's idea and your requirement. >> The biggest cost would be switching to 'odd' before an changes, and the >> unknown is when does it make sense to switch to 'even'. > > Well, there is also a problem that you would need to somehow remember with > which 'crash count' the i_version has been previously reported as that is > not stored on disk with my scheme. So I don't think we can easily use your > scheme. I don't think there is a problem here maybe I didn't explain properly or something. I'm assuming there is a crash-count that is stored once per filesystem. This might be a disk-format change, or maybe the "Last checked" time could be used with ext4 (that is a bit horrible though). Every on-disk i_version has a flag to choose between: - use this number as it is, but update it on-disk before any change - add multiple of current crash-count to this number before use. If you crash during an update, the i_version is thus automatically increased. To change from the first option to the second option you subtract the multiple of the current crash-count (which might make the stored i_version negative), and flip the bit. To change from the second option to the first, you add the multiple of the current crash-count, and flip the bit. In each case, the externally visible i_version does not change. Nothing needs to be stored except the per-inode i_version and the per-fs
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05 2017, Jan Kara wrote: > On Wed 05-04-17 11:43:32, NeilBrown wrote: >> On Tue, Apr 04 2017, J. Bruce Fields wrote: >> >> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: >> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: >> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> >> > > > Because if above is acceptable we could make reported i_version to >> >> > > > be a sum >> >> > > > of "superblock crash counter" and "inode i_version". We increment >> >> > > > "superblock crash counter" whenever we detect unclean filesystem >> >> > > > shutdown. >> >> > > > That way after a crash we are guaranteed each inode will report new >> >> > > > i_version (the sum would probably have to look like "superblock >> >> > > > crash >> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing >> >> > > > possible >> >> > > > i_version numbers we gave away but did not write to disk but >> >> > > > still...). >> >> > > > Thoughts? >> >> > >> >> > How hard is this for filesystems to support? Do they need an on-disk >> >> > format change to keep track of the crash counter? Maybe not, maybe the >> >> > high bits of the i_version counters are all they need. >> >> > >> >> >> >> Yeah, I imagine we'd need a on-disk change for this unless there's >> >> something already present that we could use in place of a crash counter. >> > >> > We could consider using the current time instead. So, put the current >> > time (or time of last boot, or this inode's ctime, or something) in the >> > high bits of the change attribute, and keep the low bits as a counter. >> >> This is a very different proposal. >> I don't think Jan was suggesting that the i_version be split into two >> bit fields, one the change-counter and one the crash-counter. >> Rather, the crash-counter was multiplied by a large-number and added to >> the change-counter with the expectation that while not ever >> change-counter landed on disk, at least 1 in every large-number would. >> So after each crash we effectively add large-number to the >> change-counter, and can be sure that number hasn't been used already. > > Yes, that was my thinking. > >> To store the crash-counter in each inode (which does appeal) you would >> need to be able to remove it before adding the new crash counter, and >> that requires bit-fields. Maybe there are enough bits. > > Furthermore you'd have a potential problem that you need to change > i_version on disk just because you are reading after a crash and such > changes tend to be problematic (think of read-only mounts and stuff like > that). > >> If you want to ensure read-only files can remain cached over a crash, >> then you would have to mark a file in some way on stable storage >> *before* allowing any change. >> e.g. you could use the lsb. Odd i_versions might have been changed >> recently and crash-count*large-number needs to be added. >> Even i_versions have not been changed recently and nothing need be >> added. >> >> If you want to change a file with an even i_version, you subtract >> crash-count*large-number >> to the i_version, then set lsb. This is written to stable storage before >> the change. >> >> If a file has not been changed for a while, you can add >> crash-count*large-number >> and clear lsb. >> >> The lsb of the i_version would be for internal use only. It would not >> be visible outside the filesystem. >> >> It feels a bit clunky, but I think it would work and is the best >> combination of Jan's idea and your requirement. >> The biggest cost would be switching to 'odd' before an changes, and the >> unknown is when does it make sense to switch to 'even'. > > Well, there is also a problem that you would need to somehow remember with > which 'crash count' the i_version has been previously reported as that is > not stored on disk with my scheme. So I don't think we can easily use your > scheme. I don't think there is a problem here maybe I didn't explain properly or something. I'm assuming there is a crash-count that is stored once per filesystem. This might be a disk-format change, or maybe the "Last checked" time could be used with ext4 (that is a bit horrible though). Every on-disk i_version has a flag to choose between: - use this number as it is, but update it on-disk before any change - add multiple of current crash-count to this number before use. If you crash during an update, the i_version is thus automatically increased. To change from the first option to the second option you subtract the multiple of the current crash-count (which might make the stored i_version negative), and flip the bit. To change from the second option to the first, you add the multiple of the current crash-count, and flip the bit. In each case, the externally visible i_version does not change. Nothing needs to be stored except the per-inode i_version and the per-fs
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > 1) Keep i_version as is, make clients also check for i_ctime. That would be a protocol revision, which we'd definitely rather avoid. But can't we accomplish the same by using something like ctime * (some constant) + i_version ? >Pro: No on-disk format changes. >Cons: After a crash, i_version can go backwards (but when file changes >i_version, i_ctime pair should be still different) or not, data can be >old or not. This is probably good enough for NFS purposes: typically on an NFS filesystem, results of a read in the face of a concurrent write open are undefined. And writers sync before close. So after a crash with a dirty inode, we're in a situation where an NFS client still needs to resend some writes, sync, and close. I'm OK with things being inconsistent during this window. I do expect things to return to normal once that client's has resent its writes--hence the worry about actually resuing old values after boot (such as if i_version regresses on boot and then increments back to the same value after further writes). Factoring in ctime fixes that. > 2) Fsync when reporting i_version. >Pro: No on-disk format changes, strong consistency of i_version and > data. >Cons: Difficult to implement for filesystems due to locking constrains. > High performance overhead or i_version reporting. Sounds painful. > 3) Some variant of crash counter. >Pro: i_version cannot go backwards. >Cons: Requires on-disk format changes. After a crash data can be old > (however i_version increased). Also, some unnecessary invalidations. Which maybe can be mostly avoided by some variation of Neil's scheme. It looks to me like option (1) is doable now and introduces no regressions compared to the current situation. (2) and (3) are more copmlicated and involve some tradeoffs. Also, we can implement (1) and switch to (2) or (3) later. We'd want to do it without reported i_versions decreasing on kernel upgrade, but there are multiple ways of handling that. (Worst case, just restrict the change to newly created filesystems.) --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote: > 1) Keep i_version as is, make clients also check for i_ctime. That would be a protocol revision, which we'd definitely rather avoid. But can't we accomplish the same by using something like ctime * (some constant) + i_version ? >Pro: No on-disk format changes. >Cons: After a crash, i_version can go backwards (but when file changes >i_version, i_ctime pair should be still different) or not, data can be >old or not. This is probably good enough for NFS purposes: typically on an NFS filesystem, results of a read in the face of a concurrent write open are undefined. And writers sync before close. So after a crash with a dirty inode, we're in a situation where an NFS client still needs to resend some writes, sync, and close. I'm OK with things being inconsistent during this window. I do expect things to return to normal once that client's has resent its writes--hence the worry about actually resuing old values after boot (such as if i_version regresses on boot and then increments back to the same value after further writes). Factoring in ctime fixes that. > 2) Fsync when reporting i_version. >Pro: No on-disk format changes, strong consistency of i_version and > data. >Cons: Difficult to implement for filesystems due to locking constrains. > High performance overhead or i_version reporting. Sounds painful. > 3) Some variant of crash counter. >Pro: i_version cannot go backwards. >Cons: Requires on-disk format changes. After a crash data can be old > (however i_version increased). Also, some unnecessary invalidations. Which maybe can be mostly avoided by some variation of Neil's scheme. It looks to me like option (1) is doable now and introduces no regressions compared to the current situation. (2) and (3) are more copmlicated and involve some tradeoffs. Also, we can implement (1) and switch to (2) or (3) later. We'd want to do it without reported i_versions decreasing on kernel upgrade, but there are multiple ways of handling that. (Worst case, just restrict the change to newly created filesystems.) --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote: > On Tue, Apr 04 2017, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > >> > > > Because if above is acceptable we could make reported i_version to > >> > > > be a sum > >> > > > of "superblock crash counter" and "inode i_version". We increment > >> > > > "superblock crash counter" whenever we detect unclean filesystem > >> > > > shutdown. > >> > > > That way after a crash we are guaranteed each inode will report new > >> > > > i_version (the sum would probably have to look like "superblock crash > >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing > >> > > > possible > >> > > > i_version numbers we gave away but did not write to disk but > >> > > > still...). > >> > > > Thoughts? > >> > > >> > How hard is this for filesystems to support? Do they need an on-disk > >> > format change to keep track of the crash counter? Maybe not, maybe the > >> > high bits of the i_version counters are all they need. > >> > > >> > >> Yeah, I imagine we'd need a on-disk change for this unless there's > >> something already present that we could use in place of a crash counter. > > > > We could consider using the current time instead. So, put the current > > time (or time of last boot, or this inode's ctime, or something) in the > > high bits of the change attribute, and keep the low bits as a counter. > > This is a very different proposal. > I don't think Jan was suggesting that the i_version be split into two > bit fields, one the change-counter and one the crash-counter. > Rather, the crash-counter was multiplied by a large-number and added to > the change-counter with the expectation that while not ever > change-counter landed on disk, at least 1 in every large-number would. > So after each crash we effectively add large-number to the > change-counter, and can be sure that number hasn't been used already. I was sort of ignoring the distinction between concatenate(A,B) and A*m+B, but, sure, multiplying's probably better. > To store the crash-counter in each inode (which does appeal) you would > need to be able to remove it before adding the new crash counter, and > that requires bit-fields. Maybe there are enough bits. i_version and the NFSv4 change attribute are 64 bits which gives us a fair amount of flexibility. > If you want to ensure read-only files can remain cached over a crash, > then you would have to mark a file in some way on stable storage > *before* allowing any change. > e.g. you could use the lsb. Odd i_versions might have been changed > recently and crash-count*large-number needs to be added. > Even i_versions have not been changed recently and nothing need be > added. > > If you want to change a file with an even i_version, you subtract > crash-count*large-number > to the i_version, then set lsb. This is written to stable storage before > the change. > > If a file has not been changed for a while, you can add > crash-count*large-number > and clear lsb. > > The lsb of the i_version would be for internal use only. It would not > be visible outside the filesystem. > > It feels a bit clunky, but I think it would work and is the best > combination of Jan's idea and your requirement. > The biggest cost would be switching to 'odd' before an changes, and the > unknown is when does it make sense to switch to 'even'. I'm not sure how to model the costs. Something like that might work. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote: > On Tue, Apr 04 2017, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > >> > > > Because if above is acceptable we could make reported i_version to > >> > > > be a sum > >> > > > of "superblock crash counter" and "inode i_version". We increment > >> > > > "superblock crash counter" whenever we detect unclean filesystem > >> > > > shutdown. > >> > > > That way after a crash we are guaranteed each inode will report new > >> > > > i_version (the sum would probably have to look like "superblock crash > >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing > >> > > > possible > >> > > > i_version numbers we gave away but did not write to disk but > >> > > > still...). > >> > > > Thoughts? > >> > > >> > How hard is this for filesystems to support? Do they need an on-disk > >> > format change to keep track of the crash counter? Maybe not, maybe the > >> > high bits of the i_version counters are all they need. > >> > > >> > >> Yeah, I imagine we'd need a on-disk change for this unless there's > >> something already present that we could use in place of a crash counter. > > > > We could consider using the current time instead. So, put the current > > time (or time of last boot, or this inode's ctime, or something) in the > > high bits of the change attribute, and keep the low bits as a counter. > > This is a very different proposal. > I don't think Jan was suggesting that the i_version be split into two > bit fields, one the change-counter and one the crash-counter. > Rather, the crash-counter was multiplied by a large-number and added to > the change-counter with the expectation that while not ever > change-counter landed on disk, at least 1 in every large-number would. > So after each crash we effectively add large-number to the > change-counter, and can be sure that number hasn't been used already. I was sort of ignoring the distinction between concatenate(A,B) and A*m+B, but, sure, multiplying's probably better. > To store the crash-counter in each inode (which does appeal) you would > need to be able to remove it before adding the new crash counter, and > that requires bit-fields. Maybe there are enough bits. i_version and the NFSv4 change attribute are 64 bits which gives us a fair amount of flexibility. > If you want to ensure read-only files can remain cached over a crash, > then you would have to mark a file in some way on stable storage > *before* allowing any change. > e.g. you could use the lsb. Odd i_versions might have been changed > recently and crash-count*large-number needs to be added. > Even i_versions have not been changed recently and nothing need be > added. > > If you want to change a file with an even i_version, you subtract > crash-count*large-number > to the i_version, then set lsb. This is written to stable storage before > the change. > > If a file has not been changed for a while, you can add > crash-count*large-number > and clear lsb. > > The lsb of the i_version would be for internal use only. It would not > be visible outside the filesystem. > > It feels a bit clunky, but I think it would work and is the best > combination of Jan's idea and your requirement. > The biggest cost would be switching to 'odd' before an changes, and the > unknown is when does it make sense to switch to 'even'. I'm not sure how to model the costs. Something like that might work. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed 05-04-17 11:43:32, NeilBrown wrote: > On Tue, Apr 04 2017, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > >> > > > Because if above is acceptable we could make reported i_version to > >> > > > be a sum > >> > > > of "superblock crash counter" and "inode i_version". We increment > >> > > > "superblock crash counter" whenever we detect unclean filesystem > >> > > > shutdown. > >> > > > That way after a crash we are guaranteed each inode will report new > >> > > > i_version (the sum would probably have to look like "superblock crash > >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing > >> > > > possible > >> > > > i_version numbers we gave away but did not write to disk but > >> > > > still...). > >> > > > Thoughts? > >> > > >> > How hard is this for filesystems to support? Do they need an on-disk > >> > format change to keep track of the crash counter? Maybe not, maybe the > >> > high bits of the i_version counters are all they need. > >> > > >> > >> Yeah, I imagine we'd need a on-disk change for this unless there's > >> something already present that we could use in place of a crash counter. > > > > We could consider using the current time instead. So, put the current > > time (or time of last boot, or this inode's ctime, or something) in the > > high bits of the change attribute, and keep the low bits as a counter. > > This is a very different proposal. > I don't think Jan was suggesting that the i_version be split into two > bit fields, one the change-counter and one the crash-counter. > Rather, the crash-counter was multiplied by a large-number and added to > the change-counter with the expectation that while not ever > change-counter landed on disk, at least 1 in every large-number would. > So after each crash we effectively add large-number to the > change-counter, and can be sure that number hasn't been used already. Yes, that was my thinking. > To store the crash-counter in each inode (which does appeal) you would > need to be able to remove it before adding the new crash counter, and > that requires bit-fields. Maybe there are enough bits. Furthermore you'd have a potential problem that you need to change i_version on disk just because you are reading after a crash and such changes tend to be problematic (think of read-only mounts and stuff like that). > If you want to ensure read-only files can remain cached over a crash, > then you would have to mark a file in some way on stable storage > *before* allowing any change. > e.g. you could use the lsb. Odd i_versions might have been changed > recently and crash-count*large-number needs to be added. > Even i_versions have not been changed recently and nothing need be > added. > > If you want to change a file with an even i_version, you subtract > crash-count*large-number > to the i_version, then set lsb. This is written to stable storage before > the change. > > If a file has not been changed for a while, you can add > crash-count*large-number > and clear lsb. > > The lsb of the i_version would be for internal use only. It would not > be visible outside the filesystem. > > It feels a bit clunky, but I think it would work and is the best > combination of Jan's idea and your requirement. > The biggest cost would be switching to 'odd' before an changes, and the > unknown is when does it make sense to switch to 'even'. Well, there is also a problem that you would need to somehow remember with which 'crash count' the i_version has been previously reported as that is not stored on disk with my scheme. So I don't think we can easily use your scheme. So the options we have are: 1) Keep i_version as is, make clients also check for i_ctime. Pro: No on-disk format changes. Cons: After a crash, i_version can go backwards (but when file changes i_version, i_ctime pair should be still different) or not, data can be old or not. 2) Fsync when reporting i_version. Pro: No on-disk format changes, strong consistency of i_version and data. Cons: Difficult to implement for filesystems due to locking constrains. High performance overhead or i_version reporting. 3) Some variant of crash counter. Pro: i_version cannot go backwards. Cons: Requires on-disk format changes. After a crash data can be old (however i_version increased). Honza -- Jan KaraSUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed 05-04-17 11:43:32, NeilBrown wrote: > On Tue, Apr 04 2017, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > >> > > > Because if above is acceptable we could make reported i_version to > >> > > > be a sum > >> > > > of "superblock crash counter" and "inode i_version". We increment > >> > > > "superblock crash counter" whenever we detect unclean filesystem > >> > > > shutdown. > >> > > > That way after a crash we are guaranteed each inode will report new > >> > > > i_version (the sum would probably have to look like "superblock crash > >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing > >> > > > possible > >> > > > i_version numbers we gave away but did not write to disk but > >> > > > still...). > >> > > > Thoughts? > >> > > >> > How hard is this for filesystems to support? Do they need an on-disk > >> > format change to keep track of the crash counter? Maybe not, maybe the > >> > high bits of the i_version counters are all they need. > >> > > >> > >> Yeah, I imagine we'd need a on-disk change for this unless there's > >> something already present that we could use in place of a crash counter. > > > > We could consider using the current time instead. So, put the current > > time (or time of last boot, or this inode's ctime, or something) in the > > high bits of the change attribute, and keep the low bits as a counter. > > This is a very different proposal. > I don't think Jan was suggesting that the i_version be split into two > bit fields, one the change-counter and one the crash-counter. > Rather, the crash-counter was multiplied by a large-number and added to > the change-counter with the expectation that while not ever > change-counter landed on disk, at least 1 in every large-number would. > So after each crash we effectively add large-number to the > change-counter, and can be sure that number hasn't been used already. Yes, that was my thinking. > To store the crash-counter in each inode (which does appeal) you would > need to be able to remove it before adding the new crash counter, and > that requires bit-fields. Maybe there are enough bits. Furthermore you'd have a potential problem that you need to change i_version on disk just because you are reading after a crash and such changes tend to be problematic (think of read-only mounts and stuff like that). > If you want to ensure read-only files can remain cached over a crash, > then you would have to mark a file in some way on stable storage > *before* allowing any change. > e.g. you could use the lsb. Odd i_versions might have been changed > recently and crash-count*large-number needs to be added. > Even i_versions have not been changed recently and nothing need be > added. > > If you want to change a file with an even i_version, you subtract > crash-count*large-number > to the i_version, then set lsb. This is written to stable storage before > the change. > > If a file has not been changed for a while, you can add > crash-count*large-number > and clear lsb. > > The lsb of the i_version would be for internal use only. It would not > be visible outside the filesystem. > > It feels a bit clunky, but I think it would work and is the best > combination of Jan's idea and your requirement. > The biggest cost would be switching to 'odd' before an changes, and the > unknown is when does it make sense to switch to 'even'. Well, there is also a problem that you would need to somehow remember with which 'crash count' the i_version has been previously reported as that is not stored on disk with my scheme. So I don't think we can easily use your scheme. So the options we have are: 1) Keep i_version as is, make clients also check for i_ctime. Pro: No on-disk format changes. Cons: After a crash, i_version can go backwards (but when file changes i_version, i_ctime pair should be still different) or not, data can be old or not. 2) Fsync when reporting i_version. Pro: No on-disk format changes, strong consistency of i_version and data. Cons: Difficult to implement for filesystems due to locking constrains. High performance overhead or i_version reporting. 3) Some variant of crash counter. Pro: i_version cannot go backwards. Cons: Requires on-disk format changes. After a crash data can be old (however i_version increased). Honza -- Jan Kara SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Apr 04 2017, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> > > > Because if above is acceptable we could make reported i_version to be >> > > > a sum >> > > > of "superblock crash counter" and "inode i_version". We increment >> > > > "superblock crash counter" whenever we detect unclean filesystem >> > > > shutdown. >> > > > That way after a crash we are guaranteed each inode will report new >> > > > i_version (the sum would probably have to look like "superblock crash >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible >> > > > i_version numbers we gave away but did not write to disk but still...). >> > > > Thoughts? >> > >> > How hard is this for filesystems to support? Do they need an on-disk >> > format change to keep track of the crash counter? Maybe not, maybe the >> > high bits of the i_version counters are all they need. >> > >> >> Yeah, I imagine we'd need a on-disk change for this unless there's >> something already present that we could use in place of a crash counter. > > We could consider using the current time instead. So, put the current > time (or time of last boot, or this inode's ctime, or something) in the > high bits of the change attribute, and keep the low bits as a counter. This is a very different proposal. I don't think Jan was suggesting that the i_version be split into two bit fields, one the change-counter and one the crash-counter. Rather, the crash-counter was multiplied by a large-number and added to the change-counter with the expectation that while not ever change-counter landed on disk, at least 1 in every large-number would. So after each crash we effectively add large-number to the change-counter, and can be sure that number hasn't been used already. To store the crash-counter in each inode (which does appeal) you would need to be able to remove it before adding the new crash counter, and that requires bit-fields. Maybe there are enough bits. If you want to ensure read-only files can remain cached over a crash, then you would have to mark a file in some way on stable storage *before* allowing any change. e.g. you could use the lsb. Odd i_versions might have been changed recently and crash-count*large-number needs to be added. Even i_versions have not been changed recently and nothing need be added. If you want to change a file with an even i_version, you subtract crash-count*large-number to the i_version, then set lsb. This is written to stable storage before the change. If a file has not been changed for a while, you can add crash-count*large-number and clear lsb. The lsb of the i_version would be for internal use only. It would not be visible outside the filesystem. It feels a bit clunky, but I think it would work and is the best combination of Jan's idea and your requirement. The biggest cost would be switching to 'odd' before an changes, and the unknown is when does it make sense to switch to 'even'. NeilBrown signature.asc Description: PGP signature
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Apr 04 2017, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> > > > Because if above is acceptable we could make reported i_version to be >> > > > a sum >> > > > of "superblock crash counter" and "inode i_version". We increment >> > > > "superblock crash counter" whenever we detect unclean filesystem >> > > > shutdown. >> > > > That way after a crash we are guaranteed each inode will report new >> > > > i_version (the sum would probably have to look like "superblock crash >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible >> > > > i_version numbers we gave away but did not write to disk but still...). >> > > > Thoughts? >> > >> > How hard is this for filesystems to support? Do they need an on-disk >> > format change to keep track of the crash counter? Maybe not, maybe the >> > high bits of the i_version counters are all they need. >> > >> >> Yeah, I imagine we'd need a on-disk change for this unless there's >> something already present that we could use in place of a crash counter. > > We could consider using the current time instead. So, put the current > time (or time of last boot, or this inode's ctime, or something) in the > high bits of the change attribute, and keep the low bits as a counter. This is a very different proposal. I don't think Jan was suggesting that the i_version be split into two bit fields, one the change-counter and one the crash-counter. Rather, the crash-counter was multiplied by a large-number and added to the change-counter with the expectation that while not ever change-counter landed on disk, at least 1 in every large-number would. So after each crash we effectively add large-number to the change-counter, and can be sure that number hasn't been used already. To store the crash-counter in each inode (which does appeal) you would need to be able to remove it before adding the new crash counter, and that requires bit-fields. Maybe there are enough bits. If you want to ensure read-only files can remain cached over a crash, then you would have to mark a file in some way on stable storage *before* allowing any change. e.g. you could use the lsb. Odd i_versions might have been changed recently and crash-count*large-number needs to be added. Even i_versions have not been changed recently and nothing need be added. If you want to change a file with an even i_version, you subtract crash-count*large-number to the i_version, then set lsb. This is written to stable storage before the change. If a file has not been changed for a while, you can add crash-count*large-number and clear lsb. The lsb of the i_version would be for internal use only. It would not be visible outside the filesystem. It feels a bit clunky, but I think it would work and is the best combination of Jan's idea and your requirement. The biggest cost would be switching to 'odd' before an changes, and the unknown is when does it make sense to switch to 'even'. NeilBrown signature.asc Description: PGP signature
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Apr 04 2017, Dave Chinner wrote: > On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: >> On Sun 02-04-17 09:05:26, Dave Chinner wrote: >> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: >> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> > > > > Because if above is acceptable we could make reported i_version to >> > > > > be a sum >> > > > > of "superblock crash counter" and "inode i_version". We increment >> > > > > "superblock crash counter" whenever we detect unclean filesystem >> > > > > shutdown. >> > > > > That way after a crash we are guaranteed each inode will report new >> > > > > i_version (the sum would probably have to look like "superblock crash >> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing >> > > > > possible >> > > > > i_version numbers we gave away but did not write to disk but >> > > > > still...). >> > > > > Thoughts? >> > > >> > > How hard is this for filesystems to support? Do they need an on-disk >> > > format change to keep track of the crash counter? >> > >> > Yes. We'll need version counter in the superblock, and we'll need to >> > know what the increment semantics are. >> > >> > The big question is how do we know there was a crash? The only thing >> > a journalling filesystem knows at mount time is whether it is clean >> > or requires recovery. Filesystems can require recovery for many >> > reasons that don't involve a crash (e.g. root fs is never unmounted >> > cleanly, so always requires recovery). Further, some filesystems may >> > not even know there was a crash at mount time because their >> > architecture always leaves a consistent filesystem on disk (e.g. COW >> > filesystems) >> >> What filesystems can or cannot easily do obviously differs. Ext4 has a >> recovery flag set in superblock on RW mount/remount and cleared on >> umount/RO remount. > > Even this doesn't help. A recent bug that was reported to the XFS > list - turns out that systemd can't remount-ro the root > filesystem sucessfully on shutdown because there are open write fds > on the root filesystem when it attempts the remount. So it just > reboots without a remount-ro. This uncovered a bug in grub in Filesystems could use register_reboot_notifier() to get a notification that even systemd cannot stuff-up. It could check for dirty data and, if there is none (which there shouldn't be if a sync happened), it does a single write to disk to update the superblock (or a single write to each disk... or something). md does this, because getting the root device to be marked read-only is even harder than getting the root filesystem to be remounted read-only. NeilBrown signature.asc Description: PGP signature
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Apr 04 2017, Dave Chinner wrote: > On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: >> On Sun 02-04-17 09:05:26, Dave Chinner wrote: >> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: >> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> > > > > Because if above is acceptable we could make reported i_version to >> > > > > be a sum >> > > > > of "superblock crash counter" and "inode i_version". We increment >> > > > > "superblock crash counter" whenever we detect unclean filesystem >> > > > > shutdown. >> > > > > That way after a crash we are guaranteed each inode will report new >> > > > > i_version (the sum would probably have to look like "superblock crash >> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing >> > > > > possible >> > > > > i_version numbers we gave away but did not write to disk but >> > > > > still...). >> > > > > Thoughts? >> > > >> > > How hard is this for filesystems to support? Do they need an on-disk >> > > format change to keep track of the crash counter? >> > >> > Yes. We'll need version counter in the superblock, and we'll need to >> > know what the increment semantics are. >> > >> > The big question is how do we know there was a crash? The only thing >> > a journalling filesystem knows at mount time is whether it is clean >> > or requires recovery. Filesystems can require recovery for many >> > reasons that don't involve a crash (e.g. root fs is never unmounted >> > cleanly, so always requires recovery). Further, some filesystems may >> > not even know there was a crash at mount time because their >> > architecture always leaves a consistent filesystem on disk (e.g. COW >> > filesystems) >> >> What filesystems can or cannot easily do obviously differs. Ext4 has a >> recovery flag set in superblock on RW mount/remount and cleared on >> umount/RO remount. > > Even this doesn't help. A recent bug that was reported to the XFS > list - turns out that systemd can't remount-ro the root > filesystem sucessfully on shutdown because there are open write fds > on the root filesystem when it attempts the remount. So it just > reboots without a remount-ro. This uncovered a bug in grub in Filesystems could use register_reboot_notifier() to get a notification that even systemd cannot stuff-up. It could check for dirty data and, if there is none (which there shouldn't be if a sync happened), it does a single write to disk to update the superblock (or a single write to each disk... or something). md does this, because getting the root device to be marked read-only is even harder than getting the root filesystem to be remounted read-only. NeilBrown signature.asc Description: PGP signature
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote: > On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were > > > > > > > reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll > > > > > > mark > > > > > > the inode dirty and I think that will end up with the new i_version > > > > > > at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more > > > > > detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk > > > blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > I think that's even more problematic. ->getattr currently runs > completely unlocked for performance reasons - it's racy w.r.t. to > ongoing modifications to begin with, so /nothing/ that is returned > to userspace via stat/statx can be guaranteed to be "coherent". > Linus will be very unhappy if you make his git workload (which is > /very/ stat heavy) run slower by adding any sort of locking in this > hot path. > > Even if we did put an fsync() into ->getattr() (and dealt with all > the locking issues that entails), by the time the statx syscall > returns to userspace the i_version value may not match the > data/metadata in the inode(*). IOWs, by the time i_version gets > to userspace, it is out of date and any use of it for data > versioning from userspace is going to be prone to race conditions. A slightly out-of-date i_version is fine, I think. It's just the reverse we want to avoid. E.g., assuming an i_version-supporting statux, if somebody could called statx then read, and got the new i_version followed by the old data, that would cause problems. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote: > On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were > > > > > > > reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll > > > > > > mark > > > > > > the inode dirty and I think that will end up with the new i_version > > > > > > at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more > > > > > detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk > > > blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > I think that's even more problematic. ->getattr currently runs > completely unlocked for performance reasons - it's racy w.r.t. to > ongoing modifications to begin with, so /nothing/ that is returned > to userspace via stat/statx can be guaranteed to be "coherent". > Linus will be very unhappy if you make his git workload (which is > /very/ stat heavy) run slower by adding any sort of locking in this > hot path. > > Even if we did put an fsync() into ->getattr() (and dealt with all > the locking issues that entails), by the time the statx syscall > returns to userspace the i_version value may not match the > data/metadata in the inode(*). IOWs, by the time i_version gets > to userspace, it is out of date and any use of it for data > versioning from userspace is going to be prone to race conditions. A slightly out-of-date i_version is fine, I think. It's just the reverse we want to avoid. E.g., assuming an i_version-supporting statux, if somebody could called statx then read, and got the new i_version followed by the old data, that would cause problems. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > Because if above is acceptable we could make reported i_version to be a > > > > sum > > > > of "superblock crash counter" and "inode i_version". We increment > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > shutdown. > > > > That way after a crash we are guaranteed each inode will report new > > > > i_version (the sum would probably have to look like "superblock crash > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > i_version numbers we gave away but did not write to disk but still...). > > > > Thoughts? > > > > How hard is this for filesystems to support? Do they need an on-disk > > format change to keep track of the crash counter? Maybe not, maybe the > > high bits of the i_version counters are all they need. > > > > Yeah, I imagine we'd need a on-disk change for this unless there's > something already present that we could use in place of a crash counter. We could consider using the current time instead. So, put the current time (or time of last boot, or this inode's ctime, or something) in the high bits of the change attribute, and keep the low bits as a counter. Then as long as we trust our clock, we're no longer at risk of reusing an i_version value. > > I guess we just want to have some back-of-the-envelope estimates of > > maximum number of i_version increments possible between crashes and > > maximum number of crashes possible over lifetime of a filesystem, to > > decide how to split up the bits. > > > > I wonder if we could get away with using the new crash counter only for > > *new* values of the i_version? After a crash, use the on disk i_version > > as is, and put off using the new crash counter until the next time the > > file's modified. > > > > That sounds difficult to get right. Suppose I have an inode that has not > been updated in a long time. Someone writes to it and then queries the > i_version. How do I know whether there were crashes since the last time > I updated it? Or am I misunderstanding what you're proposing here? I believe Jan was suggesting that we keep the i_version as-is on disk but combine with with the crash counter when it's queried. I was suggesting instead that on write, when we bump the i_version, we replace the simple increment by something that increments *and* sticks the current crash counter (or maybe just a time) in the high bits. And that combination will be what goes in i_version and is written to disk. That guarantees that we never reuse an old value when we increment. It also avoids having to invalidate absolutely every cache, even of completely static files. Clients could still see the change attribute go backwards, though, if they query a file with dirty data and the server crashes before it writes out the new change attribute. Also, they could fail to detect data that reverted after boot if they cache data while the file's dirty on the server side, and the server crash preserves the i_version update but not the new data. Presumably both are possible already for ctime (depending on the filesystem), and I'm not convinced they're a big problem. In both cases the reading client is caching while somebody's still writing, and as long as the writer comes back and finishes its job, readers will thereafter see the right thing. So I think it's adequate for close-to-open. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > Because if above is acceptable we could make reported i_version to be a > > > > sum > > > > of "superblock crash counter" and "inode i_version". We increment > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > shutdown. > > > > That way after a crash we are guaranteed each inode will report new > > > > i_version (the sum would probably have to look like "superblock crash > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > i_version numbers we gave away but did not write to disk but still...). > > > > Thoughts? > > > > How hard is this for filesystems to support? Do they need an on-disk > > format change to keep track of the crash counter? Maybe not, maybe the > > high bits of the i_version counters are all they need. > > > > Yeah, I imagine we'd need a on-disk change for this unless there's > something already present that we could use in place of a crash counter. We could consider using the current time instead. So, put the current time (or time of last boot, or this inode's ctime, or something) in the high bits of the change attribute, and keep the low bits as a counter. Then as long as we trust our clock, we're no longer at risk of reusing an i_version value. > > I guess we just want to have some back-of-the-envelope estimates of > > maximum number of i_version increments possible between crashes and > > maximum number of crashes possible over lifetime of a filesystem, to > > decide how to split up the bits. > > > > I wonder if we could get away with using the new crash counter only for > > *new* values of the i_version? After a crash, use the on disk i_version > > as is, and put off using the new crash counter until the next time the > > file's modified. > > > > That sounds difficult to get right. Suppose I have an inode that has not > been updated in a long time. Someone writes to it and then queries the > i_version. How do I know whether there were crashes since the last time > I updated it? Or am I misunderstanding what you're proposing here? I believe Jan was suggesting that we keep the i_version as-is on disk but combine with with the crash counter when it's queried. I was suggesting instead that on write, when we bump the i_version, we replace the simple increment by something that increments *and* sticks the current crash counter (or maybe just a time) in the high bits. And that combination will be what goes in i_version and is written to disk. That guarantees that we never reuse an old value when we increment. It also avoids having to invalidate absolutely every cache, even of completely static files. Clients could still see the change attribute go backwards, though, if they query a file with dirty data and the server crashes before it writes out the new change attribute. Also, they could fail to detect data that reverted after boot if they cache data while the file's dirty on the server side, and the server crash preserves the i_version update but not the new data. Presumably both are possible already for ctime (depending on the filesystem), and I'm not convinced they're a big problem. In both cases the reading client is caching while somebody's still writing, and as long as the writer comes back and finishes its job, readers will thereafter see the right thing. So I think it's adequate for close-to-open. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote: > On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: > > What filesystems can or cannot easily do obviously differs. Ext4 has a > > recovery flag set in superblock on RW mount/remount and cleared on > > umount/RO remount. > > Even this doesn't help. A recent bug that was reported to the XFS > list - turns out that systemd can't remount-ro the root > filesystem sucessfully on shutdown because there are open write fds > on the root filesystem when it attempts the remount. So it just > reboots without a remount-ro. I'd certainly rather not invalidate caches on *every* boot. On the other hand, if the only cases involve the root filesystem, then from the point of view of NFS, we probably don't care much. > > This flag being set on mount would imply incrementing the crash > > counter. It should be pretty easy for each filesystem to implement > > such flag and the counter but I agree it requires an on-disk > > format change. > > Yup, anything we want that is persistent and consistent across > filesystems will need on-disk format changes. Hence we need a solid > specification first, not to mention tests to validate correct > behaviour across all filesystems in xfstests... For xfstests we'll need a way to get i_version (patch it into statx, I guess?). Ideally we'd have a way to test behavior across crashes too, any advice there? --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote: > On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: > > What filesystems can or cannot easily do obviously differs. Ext4 has a > > recovery flag set in superblock on RW mount/remount and cleared on > > umount/RO remount. > > Even this doesn't help. A recent bug that was reported to the XFS > list - turns out that systemd can't remount-ro the root > filesystem sucessfully on shutdown because there are open write fds > on the root filesystem when it attempts the remount. So it just > reboots without a remount-ro. I'd certainly rather not invalidate caches on *every* boot. On the other hand, if the only cases involve the root filesystem, then from the point of view of NFS, we probably don't care much. > > This flag being set on mount would imply incrementing the crash > > counter. It should be pretty easy for each filesystem to implement > > such flag and the counter but I agree it requires an on-disk > > format change. > > Yup, anything we want that is persistent and consistent across > filesystems will need on-disk format changes. Hence we need a solid > specification first, not to mention tests to validate correct > behaviour across all filesystems in xfstests... For xfstests we'll need a way to get i_version (patch it into statx, I guess?). Ideally we'd have a way to test behavior across crashes too, any advice there? --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: > On Sun 02-04-17 09:05:26, Dave Chinner wrote: > > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > > Because if above is acceptable we could make reported i_version to be > > > > > a sum > > > > > of "superblock crash counter" and "inode i_version". We increment > > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > > shutdown. > > > > > That way after a crash we are guaranteed each inode will report new > > > > > i_version (the sum would probably have to look like "superblock crash > > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > > i_version numbers we gave away but did not write to disk but > > > > > still...). > > > > > Thoughts? > > > > > > How hard is this for filesystems to support? Do they need an on-disk > > > format change to keep track of the crash counter? > > > > Yes. We'll need version counter in the superblock, and we'll need to > > know what the increment semantics are. > > > > The big question is how do we know there was a crash? The only thing > > a journalling filesystem knows at mount time is whether it is clean > > or requires recovery. Filesystems can require recovery for many > > reasons that don't involve a crash (e.g. root fs is never unmounted > > cleanly, so always requires recovery). Further, some filesystems may > > not even know there was a crash at mount time because their > > architecture always leaves a consistent filesystem on disk (e.g. COW > > filesystems) > > What filesystems can or cannot easily do obviously differs. Ext4 has a > recovery flag set in superblock on RW mount/remount and cleared on > umount/RO remount. Even this doesn't help. A recent bug that was reported to the XFS list - turns out that systemd can't remount-ro the root filesystem sucessfully on shutdown because there are open write fds on the root filesystem when it attempts the remount. So it just reboots without a remount-ro. This uncovered a bug in grub in that it (still!) thinks sync(1) is sufficient to get all the metadata that points to a kernel image onto disk in places it can read. XFS, like ext4, leaves it in the journal and so the system then fails to boot because systemd didn't remount-ro the root fs and hence the journal was never flushed before reboot and so grub can't find the kernel and so everything fails > This flag being set on mount would imply incrementing the crash > counter. It should be pretty easy for each filesystem to implement > such flag and the counter but I agree it requires an on-disk > format change. Yup, anything we want that is persistent and consistent across filesystems will need on-disk format changes. Hence we need a solid specification first, not to mention tests to validate correct behaviour across all filesystems in xfstests... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote: > On Sun 02-04-17 09:05:26, Dave Chinner wrote: > > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > > Because if above is acceptable we could make reported i_version to be > > > > > a sum > > > > > of "superblock crash counter" and "inode i_version". We increment > > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > > shutdown. > > > > > That way after a crash we are guaranteed each inode will report new > > > > > i_version (the sum would probably have to look like "superblock crash > > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > > i_version numbers we gave away but did not write to disk but > > > > > still...). > > > > > Thoughts? > > > > > > How hard is this for filesystems to support? Do they need an on-disk > > > format change to keep track of the crash counter? > > > > Yes. We'll need version counter in the superblock, and we'll need to > > know what the increment semantics are. > > > > The big question is how do we know there was a crash? The only thing > > a journalling filesystem knows at mount time is whether it is clean > > or requires recovery. Filesystems can require recovery for many > > reasons that don't involve a crash (e.g. root fs is never unmounted > > cleanly, so always requires recovery). Further, some filesystems may > > not even know there was a crash at mount time because their > > architecture always leaves a consistent filesystem on disk (e.g. COW > > filesystems) > > What filesystems can or cannot easily do obviously differs. Ext4 has a > recovery flag set in superblock on RW mount/remount and cleared on > umount/RO remount. Even this doesn't help. A recent bug that was reported to the XFS list - turns out that systemd can't remount-ro the root filesystem sucessfully on shutdown because there are open write fds on the root filesystem when it attempts the remount. So it just reboots without a remount-ro. This uncovered a bug in grub in that it (still!) thinks sync(1) is sufficient to get all the metadata that points to a kernel image onto disk in places it can read. XFS, like ext4, leaves it in the journal and so the system then fails to boot because systemd didn't remount-ro the root fs and hence the journal was never flushed before reboot and so grub can't find the kernel and so everything fails > This flag being set on mount would imply incrementing the crash > counter. It should be pretty easy for each filesystem to implement > such flag and the counter but I agree it requires an on-disk > format change. Yup, anything we want that is persistent and consistent across filesystems will need on-disk format changes. Hence we need a solid specification first, not to mention tests to validate correct behaviour across all filesystems in xfstests... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Sun 02-04-17 09:05:26, Dave Chinner wrote: > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > Because if above is acceptable we could make reported i_version to be a > > > > sum > > > > of "superblock crash counter" and "inode i_version". We increment > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > shutdown. > > > > That way after a crash we are guaranteed each inode will report new > > > > i_version (the sum would probably have to look like "superblock crash > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > i_version numbers we gave away but did not write to disk but still...). > > > > Thoughts? > > > > How hard is this for filesystems to support? Do they need an on-disk > > format change to keep track of the crash counter? > > Yes. We'll need version counter in the superblock, and we'll need to > know what the increment semantics are. > > The big question is how do we know there was a crash? The only thing > a journalling filesystem knows at mount time is whether it is clean > or requires recovery. Filesystems can require recovery for many > reasons that don't involve a crash (e.g. root fs is never unmounted > cleanly, so always requires recovery). Further, some filesystems may > not even know there was a crash at mount time because their > architecture always leaves a consistent filesystem on disk (e.g. COW > filesystems) What filesystems can or cannot easily do obviously differs. Ext4 has a recovery flag set in superblock on RW mount/remount and cleared on umount/RO remount. This flag being set on mount would imply incrementing the crash counter. It should be pretty easy for each filesystem to implement such flag and the counter but I agree it requires an on-disk format change. Honza -- Jan KaraSUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Sun 02-04-17 09:05:26, Dave Chinner wrote: > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > > Because if above is acceptable we could make reported i_version to be a > > > > sum > > > > of "superblock crash counter" and "inode i_version". We increment > > > > "superblock crash counter" whenever we detect unclean filesystem > > > > shutdown. > > > > That way after a crash we are guaranteed each inode will report new > > > > i_version (the sum would probably have to look like "superblock crash > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > > i_version numbers we gave away but did not write to disk but still...). > > > > Thoughts? > > > > How hard is this for filesystems to support? Do they need an on-disk > > format change to keep track of the crash counter? > > Yes. We'll need version counter in the superblock, and we'll need to > know what the increment semantics are. > > The big question is how do we know there was a crash? The only thing > a journalling filesystem knows at mount time is whether it is clean > or requires recovery. Filesystems can require recovery for many > reasons that don't involve a crash (e.g. root fs is never unmounted > cleanly, so always requires recovery). Further, some filesystems may > not even know there was a crash at mount time because their > architecture always leaves a consistent filesystem on disk (e.g. COW > filesystems) What filesystems can or cannot easily do obviously differs. Ext4 has a recovery flag set in superblock on RW mount/remount and cleared on umount/RO remount. This flag being set on mount would imply incrementing the crash counter. It should be pretty easy for each filesystem to implement such flag and the counter but I agree it requires an on-disk format change. Honza -- Jan Kara SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > Because if above is acceptable we could make reported i_version to be a > > > sum > > > of "superblock crash counter" and "inode i_version". We increment > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > > That way after a crash we are guaranteed each inode will report new > > > i_version (the sum would probably have to look like "superblock crash > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > i_version numbers we gave away but did not write to disk but still...). > > > Thoughts? > > How hard is this for filesystems to support? Do they need an on-disk > format change to keep track of the crash counter? Yes. We'll need version counter in the superblock, and we'll need to know what the increment semantics are. The big question is how do we know there was a crash? The only thing a journalling filesystem knows at mount time is whether it is clean or requires recovery. Filesystems can require recovery for many reasons that don't involve a crash (e.g. root fs is never unmounted cleanly, so always requires recovery). Further, some filesystems may not even know there was a crash at mount time because their architecture always leaves a consistent filesystem on disk (e.g. COW filesystems) > I wonder if repeated crashes can lead to any odd corner cases. WIthout defined, locked down behavour of the superblock counter, the almost certainly corner cases will exist... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > Because if above is acceptable we could make reported i_version to be a > > > sum > > > of "superblock crash counter" and "inode i_version". We increment > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > > That way after a crash we are guaranteed each inode will report new > > > i_version (the sum would probably have to look like "superblock crash > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > i_version numbers we gave away but did not write to disk but still...). > > > Thoughts? > > How hard is this for filesystems to support? Do they need an on-disk > format change to keep track of the crash counter? Yes. We'll need version counter in the superblock, and we'll need to know what the increment semantics are. The big question is how do we know there was a crash? The only thing a journalling filesystem knows at mount time is whether it is clean or requires recovery. Filesystems can require recovery for many reasons that don't involve a crash (e.g. root fs is never unmounted cleanly, so always requires recovery). Further, some filesystems may not even know there was a crash at mount time because their architecture always leaves a consistent filesystem on disk (e.g. COW filesystems) > I wonder if repeated crashes can lead to any odd corner cases. WIthout defined, locked down behavour of the superblock counter, the almost certainly corner cases will exist... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On 03/30/2017 09:35 PM, Jeff Layton wrote: <> > Yeah, I imagine we'd need a on-disk change for this unless there's > something already present that we could use in place of a crash counter. > Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure what is a parallel for that in xfs. s_mtime - time-of-last mount s_wtime - time-of-last mount, umount, freez, unfreez, remount, ... Of course you'll need a per FS vector to access that. But this will need some math foo to get the bits compacted correctly just a thought. Thanks Boaz
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On 03/30/2017 09:35 PM, Jeff Layton wrote: <> > Yeah, I imagine we'd need a on-disk change for this unless there's > something already present that we could use in place of a crash counter. > Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure what is a parallel for that in xfs. s_mtime - time-of-last mount s_wtime - time-of-last mount, umount, freez, unfreez, remount, ... Of course you'll need a per FS vector to access that. But this will need some math foo to get the bits compacted correctly just a thought. Thanks Boaz
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > Hum, so are we fine if i_version just changes (increases) for all inodes > > > after a server crash? If I understand its use right, it would mean > > > invalidation of all client's caches but that is not such a big deal given > > > how frequent server crashes should be, right? > > Even if it's rare, it may be really painful when all your clients are > forced to throw out and repopulate their caches after a crash. But, > yes, maybe we can live with it. > Yeah, assuming that normal reboots wouldn't cause this, then I don't see it as being too bad. > > > Because if above is acceptable we could make reported i_version to be a > > > sum > > > of "superblock crash counter" and "inode i_version". We increment > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > > That way after a crash we are guaranteed each inode will report new > > > i_version (the sum would probably have to look like "superblock crash > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > i_version numbers we gave away but did not write to disk but still...). > > > Thoughts? > > How hard is this for filesystems to support? Do they need an on-disk > format change to keep track of the crash counter? Maybe not, maybe the > high bits of the i_version counters are all they need. > Yeah, I imagine we'd need a on-disk change for this unless there's something already present that we could use in place of a crash counter. > > That does sound like a good idea. This is a 64 bit value, so we should > > be able to carve out some upper bits for a crash counter without risking > > wrapping. > > > > The other constraint here is that we'd like any later version of the > > counter to be larger than any earlier value that was handed out. I think > > this idea would still satisfy that. > > I guess we just want to have some back-of-the-envelope estimates of > maximum number of i_version increments possible between crashes and > maximum number of crashes possible over lifetime of a filesystem, to > decide how to split up the bits. > > I wonder if we could get away with using the new crash counter only for > *new* values of the i_version? After a crash, use the on disk i_version > as is, and put off using the new crash counter until the next time the > file's modified. > That sounds difficult to get right. Suppose I have an inode that has not been updated in a long time. Someone writes to it and then queries the i_version. How do I know whether there were crashes since the last time I updated it? Or am I misunderstanding what you're proposing here? > That would still eliminate the risk of accidental reuse of an old > i_version value. It still leaves some cases where the client could fail > to notice an update indefinitely. All these cases I think have to > assume that a writer made some changes that it failed to ever sync, so > as long as we care only about close-to-open semantics perhaps those > cases don't matter. > > I wonder if repeated crashes can lead to any odd corner cases. > -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > > Hum, so are we fine if i_version just changes (increases) for all inodes > > > after a server crash? If I understand its use right, it would mean > > > invalidation of all client's caches but that is not such a big deal given > > > how frequent server crashes should be, right? > > Even if it's rare, it may be really painful when all your clients are > forced to throw out and repopulate their caches after a crash. But, > yes, maybe we can live with it. > Yeah, assuming that normal reboots wouldn't cause this, then I don't see it as being too bad. > > > Because if above is acceptable we could make reported i_version to be a > > > sum > > > of "superblock crash counter" and "inode i_version". We increment > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > > That way after a crash we are guaranteed each inode will report new > > > i_version (the sum would probably have to look like "superblock crash > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > > i_version numbers we gave away but did not write to disk but still...). > > > Thoughts? > > How hard is this for filesystems to support? Do they need an on-disk > format change to keep track of the crash counter? Maybe not, maybe the > high bits of the i_version counters are all they need. > Yeah, I imagine we'd need a on-disk change for this unless there's something already present that we could use in place of a crash counter. > > That does sound like a good idea. This is a 64 bit value, so we should > > be able to carve out some upper bits for a crash counter without risking > > wrapping. > > > > The other constraint here is that we'd like any later version of the > > counter to be larger than any earlier value that was handed out. I think > > this idea would still satisfy that. > > I guess we just want to have some back-of-the-envelope estimates of > maximum number of i_version increments possible between crashes and > maximum number of crashes possible over lifetime of a filesystem, to > decide how to split up the bits. > > I wonder if we could get away with using the new crash counter only for > *new* values of the i_version? After a crash, use the on disk i_version > as is, and put off using the new crash counter until the next time the > file's modified. > That sounds difficult to get right. Suppose I have an inode that has not been updated in a long time. Someone writes to it and then queries the i_version. How do I know whether there were crashes since the last time I updated it? Or am I misunderstanding what you're proposing here? > That would still eliminate the risk of accidental reuse of an old > i_version value. It still leaves some cases where the client could fail > to notice an update indefinitely. All these cases I think have to > assume that a writer made some changes that it failed to ever sync, so > as long as we care only about close-to-open semantics perhaps those > cases don't matter. > > I wonder if repeated crashes can lead to any odd corner cases. > -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > Hum, so are we fine if i_version just changes (increases) for all inodes > > after a server crash? If I understand its use right, it would mean > > invalidation of all client's caches but that is not such a big deal given > > how frequent server crashes should be, right? Even if it's rare, it may be really painful when all your clients are forced to throw out and repopulate their caches after a crash. But, yes, maybe we can live with it. > > Because if above is acceptable we could make reported i_version to be a sum > > of "superblock crash counter" and "inode i_version". We increment > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > That way after a crash we are guaranteed each inode will report new > > i_version (the sum would probably have to look like "superblock crash > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > i_version numbers we gave away but did not write to disk but still...). > > Thoughts? How hard is this for filesystems to support? Do they need an on-disk format change to keep track of the crash counter? Maybe not, maybe the high bits of the i_version counters are all they need. > That does sound like a good idea. This is a 64 bit value, so we should > be able to carve out some upper bits for a crash counter without risking > wrapping. > > The other constraint here is that we'd like any later version of the > counter to be larger than any earlier value that was handed out. I think > this idea would still satisfy that. I guess we just want to have some back-of-the-envelope estimates of maximum number of i_version increments possible between crashes and maximum number of crashes possible over lifetime of a filesystem, to decide how to split up the bits. I wonder if we could get away with using the new crash counter only for *new* values of the i_version? After a crash, use the on disk i_version as is, and put off using the new crash counter until the next time the file's modified. That would still eliminate the risk of accidental reuse of an old i_version value. It still leaves some cases where the client could fail to notice an update indefinitely. All these cases I think have to assume that a writer made some changes that it failed to ever sync, so as long as we care only about close-to-open semantics perhaps those cases don't matter. I wonder if repeated crashes can lead to any odd corner cases. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > > Hum, so are we fine if i_version just changes (increases) for all inodes > > after a server crash? If I understand its use right, it would mean > > invalidation of all client's caches but that is not such a big deal given > > how frequent server crashes should be, right? Even if it's rare, it may be really painful when all your clients are forced to throw out and repopulate their caches after a crash. But, yes, maybe we can live with it. > > Because if above is acceptable we could make reported i_version to be a sum > > of "superblock crash counter" and "inode i_version". We increment > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > > That way after a crash we are guaranteed each inode will report new > > i_version (the sum would probably have to look like "superblock crash > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > > i_version numbers we gave away but did not write to disk but still...). > > Thoughts? How hard is this for filesystems to support? Do they need an on-disk format change to keep track of the crash counter? Maybe not, maybe the high bits of the i_version counters are all they need. > That does sound like a good idea. This is a 64 bit value, so we should > be able to carve out some upper bits for a crash counter without risking > wrapping. > > The other constraint here is that we'd like any later version of the > counter to be larger than any earlier value that was handed out. I think > this idea would still satisfy that. I guess we just want to have some back-of-the-envelope estimates of maximum number of i_version increments possible between crashes and maximum number of crashes possible over lifetime of a filesystem, to decide how to split up the bits. I wonder if we could get away with using the new crash counter only for *new* values of the i_version? After a crash, use the on disk i_version as is, and put off using the new crash counter until the next time the file's modified. That would still eliminate the risk of accidental reuse of an old i_version value. It still leaves some cases where the client could fail to notice an update indefinitely. All these cases I think have to assume that a writer made some changes that it failed to ever sync, so as long as we care only about close-to-open semantics perhaps those cases don't matter. I wonder if repeated crashes can lead to any odd corner cases. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote: > On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were > > > > > > > reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll > > > > > > mark > > > > > > the inode dirty and I think that will end up with the new i_version > > > > > > at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more > > > > > detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk > > > blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > I think that's even more problematic. ->getattr currently runs > completely unlocked for performance reasons - it's racy w.r.t. to > ongoing modifications to begin with, so /nothing/ that is returned > to userspace via stat/statx can be guaranteed to be "coherent". > Linus will be very unhappy if you make his git workload (which is > /very/ stat heavy) run slower by adding any sort of locking in this > hot path. > > Even if we did put an fsync() into ->getattr() (and dealt with all > the locking issues that entails), by the time the statx syscall > returns to userspace the i_version value may not match the > data/metadata in the inode(*). IOWs, by the time i_version gets > to userspace, it is out of date and any use of it for data > versioning from userspace is going to be prone to race conditions. > > Cheers, > > Dave. > > (*) fiemap has exactly the same "stale the moment internal fs > locks are released" race conditions, which is why it cannot safely > be used for mapping holes when copying file data > FWIW, I'm not terribly worried about atomicity for all of the reasons you describe. My main concern is reusing an i_version value that has already been handed out when the inode is now in an entirely different state. To that end, I was only considering doing an fsync iff STATX_VERSION was requested. If it wasn't we wouldn't need to do one. A lot of ->getattr implementations already call filemap_write_and_wait, but you're correct that flushing out the metadata is different matter. It'd be nice to avoid that if we can. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote: > On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were > > > > > > > reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll > > > > > > mark > > > > > > the inode dirty and I think that will end up with the new i_version > > > > > > at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more > > > > > detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk > > > blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > I think that's even more problematic. ->getattr currently runs > completely unlocked for performance reasons - it's racy w.r.t. to > ongoing modifications to begin with, so /nothing/ that is returned > to userspace via stat/statx can be guaranteed to be "coherent". > Linus will be very unhappy if you make his git workload (which is > /very/ stat heavy) run slower by adding any sort of locking in this > hot path. > > Even if we did put an fsync() into ->getattr() (and dealt with all > the locking issues that entails), by the time the statx syscall > returns to userspace the i_version value may not match the > data/metadata in the inode(*). IOWs, by the time i_version gets > to userspace, it is out of date and any use of it for data > versioning from userspace is going to be prone to race conditions. > > Cheers, > > Dave. > > (*) fiemap has exactly the same "stale the moment internal fs > locks are released" race conditions, which is why it cannot safely > be used for mapping holes when copying file data > FWIW, I'm not terribly worried about atomicity for all of the reasons you describe. My main concern is reusing an i_version value that has already been handed out when the inode is now in an entirely different state. To that end, I was only considering doing an fsync iff STATX_VERSION was requested. If it wasn't we wouldn't need to do one. A lot of ->getattr implementations already call filemap_write_and_wait, but you're correct that flushing out the metadata is different matter. It'd be nice to avoid that if we can. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > On Wed 29-03-17 13:54:31, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were > > > > > > > reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll > > > > > > mark > > > > > > the inode dirty and I think that will end up with the new i_version > > > > > > at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more > > > > > detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk > > > blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > Hum, so are we fine if i_version just changes (increases) for all inodes > after a server crash? If I understand its use right, it would mean > invalidation of all client's caches but that is not such a big deal given > how frequent server crashes should be, right? > > Because if above is acceptable we could make reported i_version to be a sum > of "superblock crash counter" and "inode i_version". We increment > "superblock crash counter" whenever we detect unclean filesystem shutdown. > That way after a crash we are guaranteed each inode will report new > i_version (the sum would probably have to look like "superblock crash > counter" * 65536 + "inode i_version" so that we avoid reusing possible > i_version numbers we gave away but did not write to disk but still...). > Thoughts? > That does sound like a good idea. This is a 64 bit value, so we should be able to carve out some upper bits for a crash counter without risking wrapping. The other constraint here is that we'd like any later version of the counter to be larger than any earlier value that was handed out. I think this idea would still satisfy that. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > On Wed 29-03-17 13:54:31, Jeff Layton wrote: > > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > > - It's durable; the above comparison still works if there were > > > > > > > reboots > > > > > > > between the two i_version checks. > > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > > atomically with the changes that caused them? What if a > > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll > > > > > > mark > > > > > > the inode dirty and I think that will end up with the new i_version > > > > > > at > > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > > > So you think the filesystem can provide the atomicity? In more > > > > > detail: > > > > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > > > "Yeah, there could be atomicity issues there." > > > > > > > > I think providing that level of atomicity may be difficult, though > > > > maybe there's some way to make the querying of i_version block until > > > > the inode update has been journalled? > > > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > > Until that you can see arbitrary combination of data & i_version after the > > > crash. We do take care to keep data and metadata in sync only when there > > > are security implications to that (like exposing uninitialized disk > > > blocks) > > > and if not, we are as lazy as we can to improve performance... > > > > > > > > > > Yeah, I think what we'll have to do here is ensure that those > > filesystems do an fsync prior to reporting the i_version getattr > > codepath. It's not pretty, but I don't see a real alternative. > > Hum, so are we fine if i_version just changes (increases) for all inodes > after a server crash? If I understand its use right, it would mean > invalidation of all client's caches but that is not such a big deal given > how frequent server crashes should be, right? > > Because if above is acceptable we could make reported i_version to be a sum > of "superblock crash counter" and "inode i_version". We increment > "superblock crash counter" whenever we detect unclean filesystem shutdown. > That way after a crash we are guaranteed each inode will report new > i_version (the sum would probably have to look like "superblock crash > counter" * 65536 + "inode i_version" so that we avoid reusing possible > i_version numbers we gave away but did not write to disk but still...). > Thoughts? > That does sound like a good idea. This is a 64 bit value, so we should be able to carve out some upper bits for a crash counter without risking wrapping. The other constraint here is that we'd like any later version of the counter to be larger than any earlier value that was handed out. I think this idea would still satisfy that. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed 29-03-17 13:54:31, Jeff Layton wrote: > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > - It's durable; the above comparison still works if there were > > > > > > reboots > > > > > > between the two i_version checks. > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > atomically with the changes that caused them? What if a > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > > the inode dirty and I think that will end up with the new i_version at > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > "Yeah, there could be atomicity issues there." > > > > > > I think providing that level of atomicity may be difficult, though > > > maybe there's some way to make the querying of i_version block until > > > the inode update has been journalled? > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > Until that you can see arbitrary combination of data & i_version after the > > crash. We do take care to keep data and metadata in sync only when there > > are security implications to that (like exposing uninitialized disk blocks) > > and if not, we are as lazy as we can to improve performance... > > > > > > Yeah, I think what we'll have to do here is ensure that those > filesystems do an fsync prior to reporting the i_version getattr > codepath. It's not pretty, but I don't see a real alternative. Hum, so are we fine if i_version just changes (increases) for all inodes after a server crash? If I understand its use right, it would mean invalidation of all client's caches but that is not such a big deal given how frequent server crashes should be, right? Because if above is acceptable we could make reported i_version to be a sum of "superblock crash counter" and "inode i_version". We increment "superblock crash counter" whenever we detect unclean filesystem shutdown. That way after a crash we are guaranteed each inode will report new i_version (the sum would probably have to look like "superblock crash counter" * 65536 + "inode i_version" so that we avoid reusing possible i_version numbers we gave away but did not write to disk but still...). Thoughts? Honza -- Jan KaraSUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed 29-03-17 13:54:31, Jeff Layton wrote: > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > - It's durable; the above comparison still works if there were > > > > > > reboots > > > > > > between the two i_version checks. > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > atomically with the changes that caused them? What if a > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > > the inode dirty and I think that will end up with the new i_version at > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > "Yeah, there could be atomicity issues there." > > > > > > I think providing that level of atomicity may be difficult, though > > > maybe there's some way to make the querying of i_version block until > > > the inode update has been journalled? > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > Until that you can see arbitrary combination of data & i_version after the > > crash. We do take care to keep data and metadata in sync only when there > > are security implications to that (like exposing uninitialized disk blocks) > > and if not, we are as lazy as we can to improve performance... > > > > > > Yeah, I think what we'll have to do here is ensure that those > filesystems do an fsync prior to reporting the i_version getattr > codepath. It's not pretty, but I don't see a real alternative. Hum, so are we fine if i_version just changes (increases) for all inodes after a server crash? If I understand its use right, it would mean invalidation of all client's caches but that is not such a big deal given how frequent server crashes should be, right? Because if above is acceptable we could make reported i_version to be a sum of "superblock crash counter" and "inode i_version". We increment "superblock crash counter" whenever we detect unclean filesystem shutdown. That way after a crash we are guaranteed each inode will report new i_version (the sum would probably have to look like "superblock crash counter" * 65536 + "inode i_version" so that we avoid reusing possible i_version numbers we gave away but did not write to disk but still...). Thoughts? Honza -- Jan Kara SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > - It's durable; the above comparison still works if there were > > > > > > reboots > > > > > > between the two i_version checks. > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > atomically with the changes that caused them? What if a > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > > the inode dirty and I think that will end up with the new i_version at > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > "Yeah, there could be atomicity issues there." > > > > > > I think providing that level of atomicity may be difficult, though > > > maybe there's some way to make the querying of i_version block until > > > the inode update has been journalled? > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > Until that you can see arbitrary combination of data & i_version after the > > crash. We do take care to keep data and metadata in sync only when there > > are security implications to that (like exposing uninitialized disk blocks) > > and if not, we are as lazy as we can to improve performance... > > > > > > Yeah, I think what we'll have to do here is ensure that those > filesystems do an fsync prior to reporting the i_version getattr > codepath. It's not pretty, but I don't see a real alternative. I think that's even more problematic. ->getattr currently runs completely unlocked for performance reasons - it's racy w.r.t. to ongoing modifications to begin with, so /nothing/ that is returned to userspace via stat/statx can be guaranteed to be "coherent". Linus will be very unhappy if you make his git workload (which is /very/ stat heavy) run slower by adding any sort of locking in this hot path. Even if we did put an fsync() into ->getattr() (and dealt with all the locking issues that entails), by the time the statx syscall returns to userspace the i_version value may not match the data/metadata in the inode(*). IOWs, by the time i_version gets to userspace, it is out of date and any use of it for data versioning from userspace is going to be prone to race conditions. Cheers, Dave. (*) fiemap has exactly the same "stale the moment internal fs locks are released" race conditions, which is why it cannot safely be used for mapping holes when copying file data -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote: > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > > - It's durable; the above comparison still works if there were > > > > > > reboots > > > > > > between the two i_version checks. > > > > > > - I don't know how realistic this is--we may need to figure out > > > > > > if there's a weaker guarantee that's still useful. Do > > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > > atomically with the changes that caused them? What if a > > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > > the inode dirty and I think that will end up with the new i_version at > > > > > least being journalled before __mark_inode_dirty returns. > > > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > > > "Yeah, there could be atomicity issues there." > > > > > > I think providing that level of atomicity may be difficult, though > > > maybe there's some way to make the querying of i_version block until > > > the inode update has been journalled? > > > > Just to complement what Dave said from ext4 side - similarly as with XFS > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > > Until that you can see arbitrary combination of data & i_version after the > > crash. We do take care to keep data and metadata in sync only when there > > are security implications to that (like exposing uninitialized disk blocks) > > and if not, we are as lazy as we can to improve performance... > > > > > > Yeah, I think what we'll have to do here is ensure that those > filesystems do an fsync prior to reporting the i_version getattr > codepath. It's not pretty, but I don't see a real alternative. I think that's even more problematic. ->getattr currently runs completely unlocked for performance reasons - it's racy w.r.t. to ongoing modifications to begin with, so /nothing/ that is returned to userspace via stat/statx can be guaranteed to be "coherent". Linus will be very unhappy if you make his git workload (which is /very/ stat heavy) run slower by adding any sort of locking in this hot path. Even if we did put an fsync() into ->getattr() (and dealt with all the locking issues that entails), by the time the statx syscall returns to userspace the i_version value may not match the data/metadata in the inode(*). IOWs, by the time i_version gets to userspace, it is out of date and any use of it for data versioning from userspace is going to be prone to race conditions. Cheers, Dave. (*) fiemap has exactly the same "stale the moment internal fs locks are released" race conditions, which is why it cannot safely be used for mapping holes when copying file data -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > - It's durable; the above comparison still works if there were reboots > > > > > between the two i_version checks. > > > > > - I don't know how realistic this is--we may need to figure out > > > > > if there's a weaker guarantee that's still useful. Do > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > atomically with the changes that caused them? What if a > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > the inode dirty and I think that will end up with the new i_version at > > > > least being journalled before __mark_inode_dirty returns. > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > "Yeah, there could be atomicity issues there." > > > > I think providing that level of atomicity may be difficult, though > > maybe there's some way to make the querying of i_version block until > > the inode update has been journalled? > > Just to complement what Dave said from ext4 side - similarly as with XFS > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > Until that you can see arbitrary combination of data & i_version after the > crash. We do take care to keep data and metadata in sync only when there > are security implications to that (like exposing uninitialized disk blocks) > and if not, we are as lazy as we can to improve performance... > > Yeah, I think what we'll have to do here is ensure that those filesystems do an fsync prior to reporting the i_version getattr codepath. It's not pretty, but I don't see a real alternative. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote: > On Tue 21-03-17 14:46:53, Jeff Layton wrote: > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > - It's durable; the above comparison still works if there were reboots > > > > > between the two i_version checks. > > > > > - I don't know how realistic this is--we may need to figure out > > > > > if there's a weaker guarantee that's still useful. Do > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > atomically with the changes that caused them? What if a > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > the inode dirty and I think that will end up with the new i_version at > > > > least being journalled before __mark_inode_dirty returns. > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > "Yeah, there could be atomicity issues there." > > > > I think providing that level of atomicity may be difficult, though > > maybe there's some way to make the querying of i_version block until > > the inode update has been journalled? > > Just to complement what Dave said from ext4 side - similarly as with XFS > ext4 doesn't guarantee atomicity unless fsync() has completed on the file. > Until that you can see arbitrary combination of data & i_version after the > crash. We do take care to keep data and metadata in sync only when there > are security implications to that (like exposing uninitialized disk blocks) > and if not, we are as lazy as we can to improve performance... > > Yeah, I think what we'll have to do here is ensure that those filesystems do an fsync prior to reporting the i_version getattr codepath. It's not pretty, but I don't see a real alternative. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue 21-03-17 14:46:53, Jeff Layton wrote: > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > - It's durable; the above comparison still works if there were reboots > > > > between the two i_version checks. > > > > - I don't know how realistic this is--we may need to figure out > > > > if there's a weaker guarantee that's still useful. Do > > > > filesystems actually make ctime/mtime/i_version changes > > > > atomically with the changes that caused them? What if a > > > > change attribute is exposed to an NFS client but doesn't make > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > the inode dirty and I think that will end up with the new i_version at > > > least being journalled before __mark_inode_dirty returns. > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > Sorry, I hit send too quickly. That should have read: > > "Yeah, there could be atomicity issues there." > > I think providing that level of atomicity may be difficult, though > maybe there's some way to make the querying of i_version block until > the inode update has been journalled? Just to complement what Dave said from ext4 side - similarly as with XFS ext4 doesn't guarantee atomicity unless fsync() has completed on the file. Until that you can see arbitrary combination of data & i_version after the crash. We do take care to keep data and metadata in sync only when there are security implications to that (like exposing uninitialized disk blocks) and if not, we are as lazy as we can to improve performance... Honza -- Jan KaraSUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue 21-03-17 14:46:53, Jeff Layton wrote: > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > - It's durable; the above comparison still works if there were reboots > > > > between the two i_version checks. > > > > - I don't know how realistic this is--we may need to figure out > > > > if there's a weaker guarantee that's still useful. Do > > > > filesystems actually make ctime/mtime/i_version changes > > > > atomically with the changes that caused them? What if a > > > > change attribute is exposed to an NFS client but doesn't make > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > the inode dirty and I think that will end up with the new i_version at > > > least being journalled before __mark_inode_dirty returns. > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > Sorry, I hit send too quickly. That should have read: > > "Yeah, there could be atomicity issues there." > > I think providing that level of atomicity may be difficult, though > maybe there's some way to make the querying of i_version block until > the inode update has been journalled? Just to complement what Dave said from ext4 side - similarly as with XFS ext4 doesn't guarantee atomicity unless fsync() has completed on the file. Until that you can see arbitrary combination of data & i_version after the crash. We do take care to keep data and metadata in sync only when there are security implications to that (like exposing uninitialized disk blocks) and if not, we are as lazy as we can to improve performance... Honza -- Jan Kara SUSE Labs, CR
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, 2017-03-22 at 08:45 +1100, Dave Chinner wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - It's durable; the above comparison still works if there were reboots > > > between the two i_version checks. > > > - I don't know how realistic this is--we may need to figure out > > > if there's a weaker guarantee that's still useful. Do > > > filesystems actually make ctime/mtime/i_version changes > > > atomically with the changes that caused them? What if a > > > change attribute is exposed to an NFS client but doesn't make > > > it to disk, and then that value is reused after reboot? > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > the inode dirty and I think that will end up with the new i_version at > > least being journalled before __mark_inode_dirty returns. > > The change may be journalled, but it isn't guaranteed stable until > fsync is run on the inode. > > NFS server operations commit the metadata changed by a modification > through ->commit_metadata or sync_inode_metadata() before the > response is sent back to the client, hence guaranteeing that > i_version changes through the NFS server are stable and durable. > > This is not the case for normal operations done through the POSIX > API - the journalling is asynchronous and the only durability > guarantees are provided by fsync() > Ahh ok, I missed that...thanks. I think we'll have a hard time making this fully atomic. We may end up having to settle for something less (and doing our best to warn users of that possibility). One idea might be to tie the behavior to AT_FORCE/DONT_SYNC. In the don't sync case, allow the kernel to hand out the i_version without syncing it to disk. In the FORCE_SYNC case, do an fsync internally before returning. > > That said, I suppose it is possible for us to bump the counter, hand > > that new counter value out to a NFS client and then the box crashes > > before it makes it to the journal. > > Yup, this has aways been a problem when you mix posix applications > running on the NFS server modifying the same files as the NFS > clients are accessing and requiring synchronisation. > > > Not sure how big a problem that really is. > > This coherency problem has always existed on the server side... > Yes. I don't think this patchset makes anything worse in this regard. We will need well-defined semantics here before i_version can be exposed to userland via statx however. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, 2017-03-22 at 08:45 +1100, Dave Chinner wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - It's durable; the above comparison still works if there were reboots > > > between the two i_version checks. > > > - I don't know how realistic this is--we may need to figure out > > > if there's a weaker guarantee that's still useful. Do > > > filesystems actually make ctime/mtime/i_version changes > > > atomically with the changes that caused them? What if a > > > change attribute is exposed to an NFS client but doesn't make > > > it to disk, and then that value is reused after reboot? > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > the inode dirty and I think that will end up with the new i_version at > > least being journalled before __mark_inode_dirty returns. > > The change may be journalled, but it isn't guaranteed stable until > fsync is run on the inode. > > NFS server operations commit the metadata changed by a modification > through ->commit_metadata or sync_inode_metadata() before the > response is sent back to the client, hence guaranteeing that > i_version changes through the NFS server are stable and durable. > > This is not the case for normal operations done through the POSIX > API - the journalling is asynchronous and the only durability > guarantees are provided by fsync() > Ahh ok, I missed that...thanks. I think we'll have a hard time making this fully atomic. We may end up having to settle for something less (and doing our best to warn users of that possibility). One idea might be to tie the behavior to AT_FORCE/DONT_SYNC. In the don't sync case, allow the kernel to hand out the i_version without syncing it to disk. In the FORCE_SYNC case, do an fsync internally before returning. > > That said, I suppose it is possible for us to bump the counter, hand > > that new counter value out to a NFS client and then the box crashes > > before it makes it to the journal. > > Yup, this has aways been a problem when you mix posix applications > running on the NFS server modifying the same files as the NFS > clients are accessing and requiring synchronisation. > > > Not sure how big a problem that really is. > > This coherency problem has always existed on the server side... > Yes. I don't think this patchset makes anything worse in this regard. We will need well-defined semantics here before i_version can be exposed to userland via statx however. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > - It's durable; the above comparison still works if there were reboots > > > > > between the two i_version checks. > > > > > - I don't know how realistic this is--we may need to figure out > > > > > if there's a weaker guarantee that's still useful. Do > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > atomically with the changes that caused them? What if a > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > the inode dirty and I think that will end up with the new i_version at > > > > least being journalled before __mark_inode_dirty returns. > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > "Yeah, there could be atomicity issues there." > > > > I think providing that level of atomicity may be difficult, though > > maybe there's some way to make the querying of i_version block until > > the inode update has been journalled? > > No idea. Anyway, I'd like to figure out some reasonable requirement > that we can document. > > > > > > - if I write to a file, a simultaneous reader should see either > > > (old data, old i_version) or (new data, new i_version), not a > > > combination of the two. > > > - ditto for metadata modifications. > > > - the same should be true if there's a crash. > > > > > > (If that's not possible, then I think we could live with a brief window > > > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > > > > > That said, I suppose it is possible for us to bump the counter, hand > > > > that new counter value out to a NFS client and then the box crashes > > > > before it makes it to the journal. > > > > > > > > Not sure how big a problem that really is. > > > > > > The other case I was wondering about may have been unclear. Represent > > > the state of a file by a (data, i_version) pair. Say: > > > > > > - file is modified from (F, V) to (F', V+1). > > > - client reads and caches (F', V+1). > > > - server crashes before writeback, so disk still has (F, V). > > > - after restart, someone else modifies file to (F'', V+1). > > > - original client revalidates its cache, sees V+1, concludes > > > file data is still F'. > > > > > > This may not cause a real problem for clients depending only on > > > traditional NFS close-to-open semantics. > > > > > > > > > > No, I think that is a legitimate problem. > > > > That said, after F'', the mtime would almost certainly be different > > from the time after F', and that would likely be enough to prevent > > confusion in NFS clients. > > Oh, good point. So, may be worth saying that anyone wanting to make > sense of these across reboot should compare times as well (maybe that > should be in nfs rfc's too). I think that should be ctime not mtime, > though? > Yes, it might be worth a mention there. IIRC, it does mention that you shouldn't just look at a single attribute for cache validation purposes, but the wording is a bit vague. I can't find the section at the moment though. The more I think about it though, simply ensuring that we don't publish a new change attr until the inode update has hit the journal may be the best we can do. I'd have to think about how to implement that though. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > > - It's durable; the above comparison still works if there were reboots > > > > > between the two i_version checks. > > > > > - I don't know how realistic this is--we may need to figure out > > > > > if there's a weaker guarantee that's still useful. Do > > > > > filesystems actually make ctime/mtime/i_version changes > > > > > atomically with the changes that caused them? What if a > > > > > change attribute is exposed to an NFS client but doesn't make > > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > > the inode dirty and I think that will end up with the new i_version at > > > > least being journalled before __mark_inode_dirty returns. > > > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > > > > Sorry, I hit send too quickly. That should have read: > > > > "Yeah, there could be atomicity issues there." > > > > I think providing that level of atomicity may be difficult, though > > maybe there's some way to make the querying of i_version block until > > the inode update has been journalled? > > No idea. Anyway, I'd like to figure out some reasonable requirement > that we can document. > > > > > > - if I write to a file, a simultaneous reader should see either > > > (old data, old i_version) or (new data, new i_version), not a > > > combination of the two. > > > - ditto for metadata modifications. > > > - the same should be true if there's a crash. > > > > > > (If that's not possible, then I think we could live with a brief window > > > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > > > > > That said, I suppose it is possible for us to bump the counter, hand > > > > that new counter value out to a NFS client and then the box crashes > > > > before it makes it to the journal. > > > > > > > > Not sure how big a problem that really is. > > > > > > The other case I was wondering about may have been unclear. Represent > > > the state of a file by a (data, i_version) pair. Say: > > > > > > - file is modified from (F, V) to (F', V+1). > > > - client reads and caches (F', V+1). > > > - server crashes before writeback, so disk still has (F, V). > > > - after restart, someone else modifies file to (F'', V+1). > > > - original client revalidates its cache, sees V+1, concludes > > > file data is still F'. > > > > > > This may not cause a real problem for clients depending only on > > > traditional NFS close-to-open semantics. > > > > > > > > > > No, I think that is a legitimate problem. > > > > That said, after F'', the mtime would almost certainly be different > > from the time after F', and that would likely be enough to prevent > > confusion in NFS clients. > > Oh, good point. So, may be worth saying that anyone wanting to make > sense of these across reboot should compare times as well (maybe that > should be in nfs rfc's too). I think that should be ctime not mtime, > though? > Yes, it might be worth a mention there. IIRC, it does mention that you shouldn't just look at a single attribute for cache validation purposes, but the wording is a bit vague. I can't find the section at the moment though. The more I think about it though, simply ensuring that we don't publish a new change attr until the inode update has hit the journal may be the best we can do. I'd have to think about how to implement that though. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. The change may be journalled, but it isn't guaranteed stable until fsync is run on the inode. NFS server operations commit the metadata changed by a modification through ->commit_metadata or sync_inode_metadata() before the response is sent back to the client, hence guaranteeing that i_version changes through the NFS server are stable and durable. This is not the case for normal operations done through the POSIX API - the journalling is asynchronous and the only durability guarantees are provided by fsync() > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. Yup, this has aways been a problem when you mix posix applications running on the NFS server modifying the same files as the NFS clients are accessing and requiring synchronisation. > Not sure how big a problem that really is. This coherency problem has always existed on the server side... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. The change may be journalled, but it isn't guaranteed stable until fsync is run on the inode. NFS server operations commit the metadata changed by a modification through ->commit_metadata or sync_inode_metadata() before the response is sent back to the client, hence guaranteeing that i_version changes through the NFS server are stable and durable. This is not the case for normal operations done through the POSIX API - the journalling is asynchronous and the only durability guarantees are provided by fsync() > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. Yup, this has aways been a problem when you mix posix applications running on the NFS server modifying the same files as the NFS clients are accessing and requiring synchronisation. > Not sure how big a problem that really is. This coherency problem has always existed on the server side... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > - It's durable; the above comparison still works if there were reboots > > > > between the two i_version checks. > > > > - I don't know how realistic this is--we may need to figure out > > > > if there's a weaker guarantee that's still useful. Do > > > > filesystems actually make ctime/mtime/i_version changes > > > > atomically with the changes that caused them? What if a > > > > change attribute is exposed to an NFS client but doesn't make > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > the inode dirty and I think that will end up with the new i_version at > > > least being journalled before __mark_inode_dirty returns. > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > Sorry, I hit send too quickly. That should have read: > > "Yeah, there could be atomicity issues there." > > I think providing that level of atomicity may be difficult, though > maybe there's some way to make the querying of i_version block until > the inode update has been journalled? No idea. Anyway, I'd like to figure out some reasonable requirement that we can document. > > > - if I write to a file, a simultaneous reader should see either > > (old data, old i_version) or (new data, new i_version), not a > > combination of the two. > > - ditto for metadata modifications. > > - the same should be true if there's a crash. > > > > (If that's not possible, then I think we could live with a brief window > > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > > > That said, I suppose it is possible for us to bump the counter, hand > > > that new counter value out to a NFS client and then the box crashes > > > before it makes it to the journal. > > > > > > Not sure how big a problem that really is. > > > > The other case I was wondering about may have been unclear. Represent > > the state of a file by a (data, i_version) pair. Say: > > > > - file is modified from (F, V) to (F', V+1). > > - client reads and caches (F', V+1). > > - server crashes before writeback, so disk still has (F, V). > > - after restart, someone else modifies file to (F'', V+1). > > - original client revalidates its cache, sees V+1, concludes > > file data is still F'. > > > > This may not cause a real problem for clients depending only on > > traditional NFS close-to-open semantics. > > > > > > No, I think that is a legitimate problem. > > That said, after F'', the mtime would almost certainly be different > from the time after F', and that would likely be enough to prevent > confusion in NFS clients. Oh, good point. So, may be worth saying that anyone wanting to make sense of these across reboot should compare times as well (maybe that should be in nfs rfc's too). I think that should be ctime not mtime, though? --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > > - It's durable; the above comparison still works if there were reboots > > > > between the two i_version checks. > > > > - I don't know how realistic this is--we may need to figure out > > > > if there's a weaker guarantee that's still useful. Do > > > > filesystems actually make ctime/mtime/i_version changes > > > > atomically with the changes that caused them? What if a > > > > change attribute is exposed to an NFS client but doesn't make > > > > it to disk, and then that value is reused after reboot? > > > > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > > the inode dirty and I think that will end up with the new i_version at > > > least being journalled before __mark_inode_dirty returns. > > > > So you think the filesystem can provide the atomicity? In more detail: > > > > Sorry, I hit send too quickly. That should have read: > > "Yeah, there could be atomicity issues there." > > I think providing that level of atomicity may be difficult, though > maybe there's some way to make the querying of i_version block until > the inode update has been journalled? No idea. Anyway, I'd like to figure out some reasonable requirement that we can document. > > > - if I write to a file, a simultaneous reader should see either > > (old data, old i_version) or (new data, new i_version), not a > > combination of the two. > > - ditto for metadata modifications. > > - the same should be true if there's a crash. > > > > (If that's not possible, then I think we could live with a brief window > > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > > > That said, I suppose it is possible for us to bump the counter, hand > > > that new counter value out to a NFS client and then the box crashes > > > before it makes it to the journal. > > > > > > Not sure how big a problem that really is. > > > > The other case I was wondering about may have been unclear. Represent > > the state of a file by a (data, i_version) pair. Say: > > > > - file is modified from (F, V) to (F', V+1). > > - client reads and caches (F', V+1). > > - server crashes before writeback, so disk still has (F, V). > > - after restart, someone else modifies file to (F'', V+1). > > - original client revalidates its cache, sees V+1, concludes > > file data is still F'. > > > > This may not cause a real problem for clients depending only on > > traditional NFS close-to-open semantics. > > > > > > No, I think that is a legitimate problem. > > That said, after F'', the mtime would almost certainly be different > from the time after F', and that would likely be enough to prevent > confusion in NFS clients. Oh, good point. So, may be worth saying that anyone wanting to make sense of these across reboot should compare times as well (maybe that should be in nfs rfc's too). I think that should be ctime not mtime, though? --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. So you think the filesystem can provide the atomicity? In more detail: - if I write to a file, a simultaneous reader should see either (old data, old i_version) or (new data, new i_version), not a combination of the two. - ditto for metadata modifications. - the same should be true if there's a crash. (If that's not possible, then I think we could live with a brief window of (new data, old i_version) as long as it doesn't persist beyond sync?) > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. > > Not sure how big a problem that really is. The other case I was wondering about may have been unclear. Represent the state of a file by a (data, i_version) pair. Say: - file is modified from (F, V) to (F', V+1). - client reads and caches (F', V+1). - server crashes before writeback, so disk still has (F, V). - after restart, someone else modifies file to (F'', V+1). - original client revalidates its cache, sees V+1, concludes file data is still F'. This may not cause a real problem for clients depending only on traditional NFS close-to-open semantics. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - It's durable; the above comparison still works if there were reboots > > between the two i_version checks. > > - I don't know how realistic this is--we may need to figure out > > if there's a weaker guarantee that's still useful. Do > > filesystems actually make ctime/mtime/i_version changes > > atomically with the changes that caused them? What if a > > change attribute is exposed to an NFS client but doesn't make > > it to disk, and then that value is reused after reboot? > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > the inode dirty and I think that will end up with the new i_version at > least being journalled before __mark_inode_dirty returns. So you think the filesystem can provide the atomicity? In more detail: - if I write to a file, a simultaneous reader should see either (old data, old i_version) or (new data, new i_version), not a combination of the two. - ditto for metadata modifications. - the same should be true if there's a crash. (If that's not possible, then I think we could live with a brief window of (new data, old i_version) as long as it doesn't persist beyond sync?) > That said, I suppose it is possible for us to bump the counter, hand > that new counter value out to a NFS client and then the box crashes > before it makes it to the journal. > > Not sure how big a problem that really is. The other case I was wondering about may have been unclear. Represent the state of a file by a (data, i_version) pair. Say: - file is modified from (F, V) to (F', V+1). - client reads and caches (F', V+1). - server crashes before writeback, so disk still has (F, V). - after restart, someone else modifies file to (F'', V+1). - original client revalidates its cache, sees V+1, concludes file data is still F'. This may not cause a real problem for clients depending only on traditional NFS close-to-open semantics. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - It's durable; the above comparison still works if there were reboots > > > between the two i_version checks. > > > - I don't know how realistic this is--we may need to figure out > > > if there's a weaker guarantee that's still useful. Do > > > filesystems actually make ctime/mtime/i_version changes > > > atomically with the changes that caused them? What if a > > > change attribute is exposed to an NFS client but doesn't make > > > it to disk, and then that value is reused after reboot? > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > the inode dirty and I think that will end up with the new i_version at > > least being journalled before __mark_inode_dirty returns. > > So you think the filesystem can provide the atomicity? In more detail: > Sorry, I hit send too quickly. That should have read: "Yeah, there could be atomicity issues there." I think providing that level of atomicity may be difficult, though maybe there's some way to make the querying of i_version block until the inode update has been journalled? > - if I write to a file, a simultaneous reader should see either > (old data, old i_version) or (new data, new i_version), not a > combination of the two. > - ditto for metadata modifications. > - the same should be true if there's a crash. > > (If that's not possible, then I think we could live with a brief window > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > That said, I suppose it is possible for us to bump the counter, hand > > that new counter value out to a NFS client and then the box crashes > > before it makes it to the journal. > > > > Not sure how big a problem that really is. > > The other case I was wondering about may have been unclear. Represent > the state of a file by a (data, i_version) pair. Say: > > - file is modified from (F, V) to (F', V+1). > - client reads and caches (F', V+1). > - server crashes before writeback, so disk still has (F, V). > - after restart, someone else modifies file to (F'', V+1). > - original client revalidates its cache, sees V+1, concludes > file data is still F'. > > This may not cause a real problem for clients depending only on > traditional NFS close-to-open semantics. > > No, I think that is a legitimate problem. That said, after F'', the mtime would almost certainly be different from the time after F', and that would likely be enough to prevent confusion in NFS clients. Applications that are just looking at i_version via statx() could get confused though. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - It's durable; the above comparison still works if there were reboots > > > between the two i_version checks. > > > - I don't know how realistic this is--we may need to figure out > > > if there's a weaker guarantee that's still useful. Do > > > filesystems actually make ctime/mtime/i_version changes > > > atomically with the changes that caused them? What if a > > > change attribute is exposed to an NFS client but doesn't make > > > it to disk, and then that value is reused after reboot? > > > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark > > the inode dirty and I think that will end up with the new i_version at > > least being journalled before __mark_inode_dirty returns. > > So you think the filesystem can provide the atomicity? In more detail: > Sorry, I hit send too quickly. That should have read: "Yeah, there could be atomicity issues there." I think providing that level of atomicity may be difficult, though maybe there's some way to make the querying of i_version block until the inode update has been journalled? > - if I write to a file, a simultaneous reader should see either > (old data, old i_version) or (new data, new i_version), not a > combination of the two. > - ditto for metadata modifications. > - the same should be true if there's a crash. > > (If that's not possible, then I think we could live with a brief window > of (new data, old i_version) as long as it doesn't persist beyond sync?) > > > That said, I suppose it is possible for us to bump the counter, hand > > that new counter value out to a NFS client and then the box crashes > > before it makes it to the journal. > > > > Not sure how big a problem that really is. > > The other case I was wondering about may have been unclear. Represent > the state of a file by a (data, i_version) pair. Say: > > - file is modified from (F, V) to (F', V+1). > - client reads and caches (F', V+1). > - server crashes before writeback, so disk still has (F, V). > - after restart, someone else modifies file to (F'', V+1). > - original client revalidates its cache, sees V+1, concludes > file data is still F'. > > This may not cause a real problem for clients depending only on > traditional NFS close-to-open semantics. > > No, I think that is a legitimate problem. That said, after F'', the mtime would almost certainly be different from the time after F', and that would likely be enough to prevent confusion in NFS clients. Applications that are just looking at i_version via statx() could get confused though. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - NFS doesn't actually require that it increases, but I think it > > > should. I assume 64 bits means we don't need a discussion of > > > wraparound. > > > > I thought NFS spec required that you be able to recognize old change > > attributes so that they can be discarded. I could be wrong here though. > > I'd have to go back and look through the spec to be sure. > > https://tools.ietf.org/html/rfc7862#section-10 So, I'm suggesting we implement this one: NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST monotonically increase for every atomic change to the file attributes, data, or directory contents. It may be a slight lie--after your patches we wouldn't actually increase "for every atomic change". I think that's OK. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > > - NFS doesn't actually require that it increases, but I think it > > > should. I assume 64 bits means we don't need a discussion of > > > wraparound. > > > > I thought NFS spec required that you be able to recognize old change > > attributes so that they can be discarded. I could be wrong here though. > > I'd have to go back and look through the spec to be sure. > > https://tools.ietf.org/html/rfc7862#section-10 So, I'm suggesting we implement this one: NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR: The change attribute value MUST monotonically increase for every atomic change to the file attributes, data, or directory contents. It may be a slight lie--after your patches we wouldn't actually increase "for every atomic change". I think that's OK. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - NFS doesn't actually require that it increases, but I think it > > should. I assume 64 bits means we don't need a discussion of > > wraparound. > > I thought NFS spec required that you be able to recognize old change > attributes so that they can be discarded. I could be wrong here though. > I'd have to go back and look through the spec to be sure. https://tools.ietf.org/html/rfc7862#section-10 --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote: > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > > - NFS doesn't actually require that it increases, but I think it > > should. I assume 64 bits means we don't need a discussion of > > wraparound. > > I thought NFS spec required that you be able to recognize old change > attributes so that they can be discarded. I could be wrong here though. > I'd have to go back and look through the spec to be sure. https://tools.ietf.org/html/rfc7862#section-10 --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote: > > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > > > To me, the interesting question is whether this allows us to turn on > > > i_version updates by default on xfs and ext4. > > > > XFS v5 file systems have it on by default. > > Great, thanks. > > > Although we'll still need to agree on the exact semantics of i_version > > before it's going to be useful. > > Once it's figured out maybe we should write it up for a manpage that > could be used if statx starts exposing it to userspace. > > A first attempt: > > - It's a u64. > > - It works for regular files and directories. (What about symlinks or > other special types?) > > - It changes between two checks if and only if there were intervening > data or metadata changes. The change will always be an increase, but > the amount of the increase is meaningless. > - NFS doesn't actually require that it increases, but I think it > should. I assume 64 bits means we don't need a discussion of > wraparound. I thought NFS spec required that you be able to recognize old change attributes so that they can be discarded. I could be wrong here though. I'd have to go back and look through the spec to be sure. > - AFS wants an actual counter: if you get i_version X, then > write twice, then get i_version X+2, you're allowed to assume > your writes were the only modifications. Let's ignore this > for now. In the future if someone explains how to count > operations, then we can extend the interface to tell the > caller it can get those extra semantics. > > - It's durable; the above comparison still works if there were reboots > between the two i_version checks. > - I don't know how realistic this is--we may need to figure out > if there's a weaker guarantee that's still useful. Do > filesystems actually make ctime/mtime/i_version changes > atomically with the changes that caused them? What if a > change attribute is exposed to an NFS client but doesn't make > it to disk, and then that value is reused after reboot? > Yeah, there could be atomicity there. If we bump i_version, we'll mark the inode dirty and I think that will end up with the new i_version at least being journalled before __mark_inode_dirty returns. That said, I suppose it is possible for us to bump the counter, hand that new counter value out to a NFS client and then the box crashes before it makes it to the journal. Not sure how big a problem that really is. > Am I missing any issues? > No, I think you have it covered, and that's pretty much exactly what I had in mind as far as semantics go. Thanks for writing it up! -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote: > On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote: > > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > > > To me, the interesting question is whether this allows us to turn on > > > i_version updates by default on xfs and ext4. > > > > XFS v5 file systems have it on by default. > > Great, thanks. > > > Although we'll still need to agree on the exact semantics of i_version > > before it's going to be useful. > > Once it's figured out maybe we should write it up for a manpage that > could be used if statx starts exposing it to userspace. > > A first attempt: > > - It's a u64. > > - It works for regular files and directories. (What about symlinks or > other special types?) > > - It changes between two checks if and only if there were intervening > data or metadata changes. The change will always be an increase, but > the amount of the increase is meaningless. > - NFS doesn't actually require that it increases, but I think it > should. I assume 64 bits means we don't need a discussion of > wraparound. I thought NFS spec required that you be able to recognize old change attributes so that they can be discarded. I could be wrong here though. I'd have to go back and look through the spec to be sure. > - AFS wants an actual counter: if you get i_version X, then > write twice, then get i_version X+2, you're allowed to assume > your writes were the only modifications. Let's ignore this > for now. In the future if someone explains how to count > operations, then we can extend the interface to tell the > caller it can get those extra semantics. > > - It's durable; the above comparison still works if there were reboots > between the two i_version checks. > - I don't know how realistic this is--we may need to figure out > if there's a weaker guarantee that's still useful. Do > filesystems actually make ctime/mtime/i_version changes > atomically with the changes that caused them? What if a > change attribute is exposed to an NFS client but doesn't make > it to disk, and then that value is reused after reboot? > Yeah, there could be atomicity there. If we bump i_version, we'll mark the inode dirty and I think that will end up with the new i_version at least being journalled before __mark_inode_dirty returns. That said, I suppose it is possible for us to bump the counter, hand that new counter value out to a NFS client and then the box crashes before it makes it to the journal. Not sure how big a problem that really is. > Am I missing any issues? > No, I think you have it covered, and that's pretty much exactly what I had in mind as far as semantics go. Thanks for writing it up! -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote: > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > > To me, the interesting question is whether this allows us to turn on > > i_version updates by default on xfs and ext4. > > XFS v5 file systems have it on by default. Great, thanks. > Although we'll still need to agree on the exact semantics of i_version > before it's going to be useful. Once it's figured out maybe we should write it up for a manpage that could be used if statx starts exposing it to userspace. A first attempt: - It's a u64. - It works for regular files and directories. (What about symlinks or other special types?) - It changes between two checks if and only if there were intervening data or metadata changes. The change will always be an increase, but the amount of the increase is meaningless. - NFS doesn't actually require that it increases, but I think it should. I assume 64 bits means we don't need a discussion of wraparound. - AFS wants an actual counter: if you get i_version X, then write twice, then get i_version X+2, you're allowed to assume your writes were the only modifications. Let's ignore this for now. In the future if someone explains how to count operations, then we can extend the interface to tell the caller it can get those extra semantics. - It's durable; the above comparison still works if there were reboots between the two i_version checks. - I don't know how realistic this is--we may need to figure out if there's a weaker guarantee that's still useful. Do filesystems actually make ctime/mtime/i_version changes atomically with the changes that caused them? What if a change attribute is exposed to an NFS client but doesn't make it to disk, and then that value is reused after reboot? Am I missing any issues? --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote: > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > > To me, the interesting question is whether this allows us to turn on > > i_version updates by default on xfs and ext4. > > XFS v5 file systems have it on by default. Great, thanks. > Although we'll still need to agree on the exact semantics of i_version > before it's going to be useful. Once it's figured out maybe we should write it up for a manpage that could be used if statx starts exposing it to userspace. A first attempt: - It's a u64. - It works for regular files and directories. (What about symlinks or other special types?) - It changes between two checks if and only if there were intervening data or metadata changes. The change will always be an increase, but the amount of the increase is meaningless. - NFS doesn't actually require that it increases, but I think it should. I assume 64 bits means we don't need a discussion of wraparound. - AFS wants an actual counter: if you get i_version X, then write twice, then get i_version X+2, you're allowed to assume your writes were the only modifications. Let's ignore this for now. In the future if someone explains how to count operations, then we can extend the interface to tell the caller it can get those extra semantics. - It's durable; the above comparison still works if there were reboots between the two i_version checks. - I don't know how realistic this is--we may need to figure out if there's a weaker guarantee that's still useful. Do filesystems actually make ctime/mtime/i_version changes atomically with the changes that caused them? What if a change attribute is exposed to an NFS client but doesn't make it to disk, and then that value is reused after reboot? Am I missing any issues? --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > To me, the interesting question is whether this allows us to turn on > i_version updates by default on xfs and ext4. XFS v5 file systems have it on by default. Although we'll still need to agree on the exact semantics of i_version before it's going to be useful.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote: > To me, the interesting question is whether this allows us to turn on > i_version updates by default on xfs and ext4. XFS v5 file systems have it on by default. Although we'll still need to agree on the exact semantics of i_version before it's going to be useful.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Dec 22, 2016 at 09:42:04AM -0500, Jeff Layton wrote: > On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote: > > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > > > > > Only btrfs, ext4, and xfs implement it for data changes. Because of > > > this, these filesystems must log the inode to disk whenever the > > > i_version counter changes. That has a non-zero performance impact, > > > especially on write-heavy workloads, because we end up dirtying the > > > inode metadata on every write, not just when the times change. [1] > > > > Do you have numbers to justify these changes? > > I have numbers. As to whether they justify the changes, I'm not sure. > This helps a lot on a (admittedly nonsensical) 1-byte write workload. On > XFS, with this fio jobfile: To me, the interesting question is whether this allows us to turn on i_version updates by default on xfs and ext4. When Josef looked at doing that previously he withdrew the patch due to performance regressions. I think the most useful thread started here: http://lkml.kernel.org/r/1337092396-3272-1-git-send-email-jo...@redhat.com Skimming quickly I think the regression was also in the small-write case. So apparently that was thought to reveal a real problem? So if you've mostly eliminated that regression, then that's good motivation for your patches. (Though I think in addition to comparing the patched and unpatched i_version case, we need to compare to the unpatched not-i_version case. I'm not clear whether you did that.) --b. > > 8<-- > [global] > direct=0 > size=2g > filesize=512m > bsrange=1-1 > timeout=60 > numjobs=1 > directory=/mnt/scratch > > [f1] > filename=randwrite > rw=randwrite > 8<-- > > Unpatched kernel: > WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, > mint=6msec, maxt=6msec > > Patched kernel: > WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, > mint=6msec, maxt=6msec > > So quite a difference there and it's pretty consistent across runs. If I > change the jobfile to have "direct=1" and "bsrange=4k-4k", then any > variation between the two doesn't seem to be significant (numbers vary > as much between runs on the same kernels and are roughly the same). > > Playing with buffered I/O sizes between 1 byte and 4k shows that as the > I/O sizes get larger, this makes less difference (which is what I'd > expect). > > Previous testing with ext4 shows roughly the same results. btrfs shows > some benefit here but significantly less than with ext4 or xfs. Not sure > why that is yet -- maybe CoW effects? > > That said, I don't have a great test rig for this. I'm using VMs with a > dedicated LVM volume that's on a random SSD I had laying around. It > could use testing on a wider set of configurations and workloads. > > I was also hoping that others may have workloads that they think might > be (postively or negatively) affected by these changes. If you can think > of any in particular, then I'm interested to hear about them. > > -- > Jeff Layton> -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, Dec 22, 2016 at 09:42:04AM -0500, Jeff Layton wrote: > On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote: > > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > > > > > Only btrfs, ext4, and xfs implement it for data changes. Because of > > > this, these filesystems must log the inode to disk whenever the > > > i_version counter changes. That has a non-zero performance impact, > > > especially on write-heavy workloads, because we end up dirtying the > > > inode metadata on every write, not just when the times change. [1] > > > > Do you have numbers to justify these changes? > > I have numbers. As to whether they justify the changes, I'm not sure. > This helps a lot on a (admittedly nonsensical) 1-byte write workload. On > XFS, with this fio jobfile: To me, the interesting question is whether this allows us to turn on i_version updates by default on xfs and ext4. When Josef looked at doing that previously he withdrew the patch due to performance regressions. I think the most useful thread started here: http://lkml.kernel.org/r/1337092396-3272-1-git-send-email-jo...@redhat.com Skimming quickly I think the regression was also in the small-write case. So apparently that was thought to reveal a real problem? So if you've mostly eliminated that regression, then that's good motivation for your patches. (Though I think in addition to comparing the patched and unpatched i_version case, we need to compare to the unpatched not-i_version case. I'm not clear whether you did that.) --b. > > 8<-- > [global] > direct=0 > size=2g > filesize=512m > bsrange=1-1 > timeout=60 > numjobs=1 > directory=/mnt/scratch > > [f1] > filename=randwrite > rw=randwrite > 8<-- > > Unpatched kernel: > WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, > mint=6msec, maxt=6msec > > Patched kernel: > WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, > mint=6msec, maxt=6msec > > So quite a difference there and it's pretty consistent across runs. If I > change the jobfile to have "direct=1" and "bsrange=4k-4k", then any > variation between the two doesn't seem to be significant (numbers vary > as much between runs on the same kernels and are roughly the same). > > Playing with buffered I/O sizes between 1 byte and 4k shows that as the > I/O sizes get larger, this makes less difference (which is what I'd > expect). > > Previous testing with ext4 shows roughly the same results. btrfs shows > some benefit here but significantly less than with ext4 or xfs. Not sure > why that is yet -- maybe CoW effects? > > That said, I don't have a great test rig for this. I'm using VMs with a > dedicated LVM volume that's on a random SSD I had laying around. It > could use testing on a wider set of configurations and workloads. > > I was also hoping that others may have workloads that they think might > be (postively or negatively) affected by these changes. If you can think > of any in particular, then I'm interested to hear about them. > > -- > Jeff Layton > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, Mar 03, 2017 at 07:53:57PM -0500, Jeff Layton wrote: > On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote: > > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > > tl;dr: I think we can greatly reduce the cost of the inode->i_version > > > counter, by exploiting the fact that we don't need to increment it > > > if no one is looking at it. We can also clean up the code to prepare > > > to eventually expose this value via statx(). > > > > > > The inode->i_version field is supposed to be a value that changes > > > whenever there is any data or metadata change to the inode. Some > > > filesystems use it internally to detect directory changes during > > > readdir. knfsd will use it if the filesystem has MS_I_VERSION > > > set. IMA will also use it (though it's not clear to me that that > > > works 100% -- no check for MS_I_VERSION there). > > > > I'm a little confused about the internal uses for directories. Is it > > necessarily kept on disk? > > No, they aren't necessarily stored on disk, and in fact they aren't on > most (maybe all?) of those filesystems. It's just a convenient place to > store a dir change value that is subsequently checked in readdir > operations. > > That's part of the "fun" of untangling this mess. ;) > > > In cases it's not, then there's not the same > > performance issue, right? > > Right, I don't think it's really a big deal for most of those and I'm > not terribly concerned about the i_version counter on directory change > operations. An extra atomic op on a directory change seems unlikely to > hurt anything. > > The main purpose of this is to improve the situation with small writes. > This should also help pave the way for fs' like NFS and Ceph that > implement a version counter but may not necessarily bump it on every > change. > > I think once we have things more consistent, we'll be able to consider > exposing the i_version counter to userland via statx. > > > Is there any risk these patches make > > performance slightly worse in that case? > > > > Maybe, but I think that risk is pretty low. Note that I haven't measured > that specifically here, so I could be completely wrong. > > If it is a problem, we could consider unioning this thing with a non- > atomic field for the directory change cases, but that would add some > complexity and I'm not sure it's worth it. I'd want to measure it first. Makes sense to me. I agree that it's unlikely to be a problem, just wanted to make sure I understood Anyway, I'm a great fan of the basic idea, I hope we can get this done. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, Mar 03, 2017 at 07:53:57PM -0500, Jeff Layton wrote: > On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote: > > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > > tl;dr: I think we can greatly reduce the cost of the inode->i_version > > > counter, by exploiting the fact that we don't need to increment it > > > if no one is looking at it. We can also clean up the code to prepare > > > to eventually expose this value via statx(). > > > > > > The inode->i_version field is supposed to be a value that changes > > > whenever there is any data or metadata change to the inode. Some > > > filesystems use it internally to detect directory changes during > > > readdir. knfsd will use it if the filesystem has MS_I_VERSION > > > set. IMA will also use it (though it's not clear to me that that > > > works 100% -- no check for MS_I_VERSION there). > > > > I'm a little confused about the internal uses for directories. Is it > > necessarily kept on disk? > > No, they aren't necessarily stored on disk, and in fact they aren't on > most (maybe all?) of those filesystems. It's just a convenient place to > store a dir change value that is subsequently checked in readdir > operations. > > That's part of the "fun" of untangling this mess. ;) > > > In cases it's not, then there's not the same > > performance issue, right? > > Right, I don't think it's really a big deal for most of those and I'm > not terribly concerned about the i_version counter on directory change > operations. An extra atomic op on a directory change seems unlikely to > hurt anything. > > The main purpose of this is to improve the situation with small writes. > This should also help pave the way for fs' like NFS and Ceph that > implement a version counter but may not necessarily bump it on every > change. > > I think once we have things more consistent, we'll be able to consider > exposing the i_version counter to userland via statx. > > > Is there any risk these patches make > > performance slightly worse in that case? > > > > Maybe, but I think that risk is pretty low. Note that I haven't measured > that specifically here, so I could be completely wrong. > > If it is a problem, we could consider unioning this thing with a non- > atomic field for the directory change cases, but that would add some > complexity and I'm not sure it's worth it. I'd want to measure it first. Makes sense to me. I agree that it's unlikely to be a problem, just wanted to make sure I understood Anyway, I'm a great fan of the basic idea, I hope we can get this done. --b.
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote: > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > tl;dr: I think we can greatly reduce the cost of the inode->i_version > > counter, by exploiting the fact that we don't need to increment it > > if no one is looking at it. We can also clean up the code to prepare > > to eventually expose this value via statx(). > > > > The inode->i_version field is supposed to be a value that changes > > whenever there is any data or metadata change to the inode. Some > > filesystems use it internally to detect directory changes during > > readdir. knfsd will use it if the filesystem has MS_I_VERSION > > set. IMA will also use it (though it's not clear to me that that > > works 100% -- no check for MS_I_VERSION there). > > I'm a little confused about the internal uses for directories. Is it > necessarily kept on disk? No, they aren't necessarily stored on disk, and in fact they aren't on most (maybe all?) of those filesystems. It's just a convenient place to store a dir change value that is subsequently checked in readdir operations. That's part of the "fun" of untangling this mess. ;) > In cases it's not, then there's not the same > performance issue, right? Right, I don't think it's really a big deal for most of those and I'm not terribly concerned about the i_version counter on directory change operations. An extra atomic op on a directory change seems unlikely to hurt anything. The main purpose of this is to improve the situation with small writes. This should also help pave the way for fs' like NFS and Ceph that implement a version counter but may not necessarily bump it on every change. I think once we have things more consistent, we'll be able to consider exposing the i_version counter to userland via statx. > Is there any risk these patches make > performance slightly worse in that case? > Maybe, but I think that risk is pretty low. Note that I haven't measured that specifically here, so I could be completely wrong. If it is a problem, we could consider unioning this thing with a non- atomic field for the directory change cases, but that would add some complexity and I'm not sure it's worth it. I'd want to measure it first. > > Only btrfs, ext4, and xfs implement it for data changes. Because of > > this, these filesystems must log the inode to disk whenever the > > i_version counter changes. > > On those filesystems that's done for both files and directories, right? > Yes. > > That has a non-zero performance impact, > > especially on write-heavy workloads, because we end up dirtying the > > inode metadata on every write, not just when the times change. [1] > > > > It turns out though that none of these users of i_version require that > > i_version change on every change to the file. The only real requirement > > is that it be different if _something_ changed since the last time we > > queried for it. [2] > > > > So, if we simply keep track of when something queries the value, we > > can avoid bumping the counter and that metadata update. > > > > This patchset implements this: > > > > It starts with some small cleanup patches to just remove any mention of > > the i_version counter in filesystems that don't actually use it. > > > > Then, we add a new set of static inlines for managing the counter. The > > initial version should work identically to what we have now. Then, all > > of the remaining filesystems that use i_version are converted to the new > > inlines. > > > > Once that's in place, we switch to a new implementation that allows us > > to track readers of i_version counter, and only bump it when it's > > necessary or convenient (i.e. we're going to disk anyway). > > > > The final patch switches from a scheme that uses the i_lock to serialize > > the counter updates during write to an atomic64_t. That's a wash > > performance-wise in my testing, but I like not having to take the i_lock > > down where it could end up nested inside other locks. > > > > With this, we reduce inode metadata updates across all 3 filesystems > > down to roughly the frequency of the timestamp granularity, particularly > > when it's not being queried (the vastly common case). > > > > The pessimal workload here is 1 byte writes, and it helps that > > significantly. Of course, that's not a real-world workload. > > > > A tiobench-example.fio workload also shows some modest performance > > gains, and I've gotten mails from the kernel test robot that show some > > significant performance gains on some microbenchmarks (case-msync-mt in > > the vm-scalability testsuite to be specific). > > > > I'm happy to run other workloads if anyone can suggest them. > > > > At this point, the patchset works and does what it's expected to do in > > my own testing. It seems like it's at least a modest performance win > > across all 3 major disk-based filesystems. It may also encourage others > > to implement i_version as well since it reduces that
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote: > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > tl;dr: I think we can greatly reduce the cost of the inode->i_version > > counter, by exploiting the fact that we don't need to increment it > > if no one is looking at it. We can also clean up the code to prepare > > to eventually expose this value via statx(). > > > > The inode->i_version field is supposed to be a value that changes > > whenever there is any data or metadata change to the inode. Some > > filesystems use it internally to detect directory changes during > > readdir. knfsd will use it if the filesystem has MS_I_VERSION > > set. IMA will also use it (though it's not clear to me that that > > works 100% -- no check for MS_I_VERSION there). > > I'm a little confused about the internal uses for directories. Is it > necessarily kept on disk? No, they aren't necessarily stored on disk, and in fact they aren't on most (maybe all?) of those filesystems. It's just a convenient place to store a dir change value that is subsequently checked in readdir operations. That's part of the "fun" of untangling this mess. ;) > In cases it's not, then there's not the same > performance issue, right? Right, I don't think it's really a big deal for most of those and I'm not terribly concerned about the i_version counter on directory change operations. An extra atomic op on a directory change seems unlikely to hurt anything. The main purpose of this is to improve the situation with small writes. This should also help pave the way for fs' like NFS and Ceph that implement a version counter but may not necessarily bump it on every change. I think once we have things more consistent, we'll be able to consider exposing the i_version counter to userland via statx. > Is there any risk these patches make > performance slightly worse in that case? > Maybe, but I think that risk is pretty low. Note that I haven't measured that specifically here, so I could be completely wrong. If it is a problem, we could consider unioning this thing with a non- atomic field for the directory change cases, but that would add some complexity and I'm not sure it's worth it. I'd want to measure it first. > > Only btrfs, ext4, and xfs implement it for data changes. Because of > > this, these filesystems must log the inode to disk whenever the > > i_version counter changes. > > On those filesystems that's done for both files and directories, right? > Yes. > > That has a non-zero performance impact, > > especially on write-heavy workloads, because we end up dirtying the > > inode metadata on every write, not just when the times change. [1] > > > > It turns out though that none of these users of i_version require that > > i_version change on every change to the file. The only real requirement > > is that it be different if _something_ changed since the last time we > > queried for it. [2] > > > > So, if we simply keep track of when something queries the value, we > > can avoid bumping the counter and that metadata update. > > > > This patchset implements this: > > > > It starts with some small cleanup patches to just remove any mention of > > the i_version counter in filesystems that don't actually use it. > > > > Then, we add a new set of static inlines for managing the counter. The > > initial version should work identically to what we have now. Then, all > > of the remaining filesystems that use i_version are converted to the new > > inlines. > > > > Once that's in place, we switch to a new implementation that allows us > > to track readers of i_version counter, and only bump it when it's > > necessary or convenient (i.e. we're going to disk anyway). > > > > The final patch switches from a scheme that uses the i_lock to serialize > > the counter updates during write to an atomic64_t. That's a wash > > performance-wise in my testing, but I like not having to take the i_lock > > down where it could end up nested inside other locks. > > > > With this, we reduce inode metadata updates across all 3 filesystems > > down to roughly the frequency of the timestamp granularity, particularly > > when it's not being queried (the vastly common case). > > > > The pessimal workload here is 1 byte writes, and it helps that > > significantly. Of course, that's not a real-world workload. > > > > A tiobench-example.fio workload also shows some modest performance > > gains, and I've gotten mails from the kernel test robot that show some > > significant performance gains on some microbenchmarks (case-msync-mt in > > the vm-scalability testsuite to be specific). > > > > I'm happy to run other workloads if anyone can suggest them. > > > > At this point, the patchset works and does what it's expected to do in > > my own testing. It seems like it's at least a modest performance win > > across all 3 major disk-based filesystems. It may also encourage others > > to implement i_version as well since it reduces that
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > tl;dr: I think we can greatly reduce the cost of the inode->i_version > counter, by exploiting the fact that we don't need to increment it > if no one is looking at it. We can also clean up the code to prepare > to eventually expose this value via statx(). > > The inode->i_version field is supposed to be a value that changes > whenever there is any data or metadata change to the inode. Some > filesystems use it internally to detect directory changes during > readdir. knfsd will use it if the filesystem has MS_I_VERSION > set. IMA will also use it (though it's not clear to me that that > works 100% -- no check for MS_I_VERSION there). I'm a little confused about the internal uses for directories. Is it necessarily kept on disk? In cases it's not, then there's not the same performance issue, right? Is there any risk these patches make performance slightly worse in that case? > Only btrfs, ext4, and xfs implement it for data changes. Because of > this, these filesystems must log the inode to disk whenever the > i_version counter changes. On those filesystems that's done for both files and directories, right? --b. > That has a non-zero performance impact, > especially on write-heavy workloads, because we end up dirtying the > inode metadata on every write, not just when the times change. [1] > > It turns out though that none of these users of i_version require that > i_version change on every change to the file. The only real requirement > is that it be different if _something_ changed since the last time we > queried for it. [2] > > So, if we simply keep track of when something queries the value, we > can avoid bumping the counter and that metadata update. > > This patchset implements this: > > It starts with some small cleanup patches to just remove any mention of > the i_version counter in filesystems that don't actually use it. > > Then, we add a new set of static inlines for managing the counter. The > initial version should work identically to what we have now. Then, all > of the remaining filesystems that use i_version are converted to the new > inlines. > > Once that's in place, we switch to a new implementation that allows us > to track readers of i_version counter, and only bump it when it's > necessary or convenient (i.e. we're going to disk anyway). > > The final patch switches from a scheme that uses the i_lock to serialize > the counter updates during write to an atomic64_t. That's a wash > performance-wise in my testing, but I like not having to take the i_lock > down where it could end up nested inside other locks. > > With this, we reduce inode metadata updates across all 3 filesystems > down to roughly the frequency of the timestamp granularity, particularly > when it's not being queried (the vastly common case). > > The pessimal workload here is 1 byte writes, and it helps that > significantly. Of course, that's not a real-world workload. > > A tiobench-example.fio workload also shows some modest performance > gains, and I've gotten mails from the kernel test robot that show some > significant performance gains on some microbenchmarks (case-msync-mt in > the vm-scalability testsuite to be specific). > > I'm happy to run other workloads if anyone can suggest them. > > At this point, the patchset works and does what it's expected to do in > my own testing. It seems like it's at least a modest performance win > across all 3 major disk-based filesystems. It may also encourage others > to implement i_version as well since it reduces that cost. > > Is this an avenue that's worthwhile to pursue? > > Note that I think we may have other changes coming in the future that > will make this sort of cleanup necessary anyway. I'd like to plug in the > Ceph change attribute here eventually, and that will require something > like this anyway. > > Thoughts, comments and suggestions are welcome... > > --- > > [1]: On ext4 it must be turned on with the i_version mount option, > mostly due to fears of incurring this impact, AFAICT. > > [2]: NFS also recommends that it appear to increase in value over time, so > that clients can discard metadata updates that are older than ones > they've already seen. > > Jeff Layton (30): > lustre: don't set f_version in ll_readdir > ecryptfs: remove unnecessary i_version bump > ceph: remove the bump of i_version > f2fs: don't bother setting i_version > hpfs: don't bother with the i_version counter > jfs: remove initialization of i_version counter > nilfs2: remove inode->i_version initialization > orangefs: remove initialization of i_version > reiserfs: remove unneeded i_version bump > ntfs: remove i_version handling > fs: new API for handling i_version > fat: convert to new i_version API > affs: convert to new i_version API > afs: convert to new i_version API > btrfs: convert to new i_version API > exofs: switch to new i_version API >
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > tl;dr: I think we can greatly reduce the cost of the inode->i_version > counter, by exploiting the fact that we don't need to increment it > if no one is looking at it. We can also clean up the code to prepare > to eventually expose this value via statx(). > > The inode->i_version field is supposed to be a value that changes > whenever there is any data or metadata change to the inode. Some > filesystems use it internally to detect directory changes during > readdir. knfsd will use it if the filesystem has MS_I_VERSION > set. IMA will also use it (though it's not clear to me that that > works 100% -- no check for MS_I_VERSION there). I'm a little confused about the internal uses for directories. Is it necessarily kept on disk? In cases it's not, then there's not the same performance issue, right? Is there any risk these patches make performance slightly worse in that case? > Only btrfs, ext4, and xfs implement it for data changes. Because of > this, these filesystems must log the inode to disk whenever the > i_version counter changes. On those filesystems that's done for both files and directories, right? --b. > That has a non-zero performance impact, > especially on write-heavy workloads, because we end up dirtying the > inode metadata on every write, not just when the times change. [1] > > It turns out though that none of these users of i_version require that > i_version change on every change to the file. The only real requirement > is that it be different if _something_ changed since the last time we > queried for it. [2] > > So, if we simply keep track of when something queries the value, we > can avoid bumping the counter and that metadata update. > > This patchset implements this: > > It starts with some small cleanup patches to just remove any mention of > the i_version counter in filesystems that don't actually use it. > > Then, we add a new set of static inlines for managing the counter. The > initial version should work identically to what we have now. Then, all > of the remaining filesystems that use i_version are converted to the new > inlines. > > Once that's in place, we switch to a new implementation that allows us > to track readers of i_version counter, and only bump it when it's > necessary or convenient (i.e. we're going to disk anyway). > > The final patch switches from a scheme that uses the i_lock to serialize > the counter updates during write to an atomic64_t. That's a wash > performance-wise in my testing, but I like not having to take the i_lock > down where it could end up nested inside other locks. > > With this, we reduce inode metadata updates across all 3 filesystems > down to roughly the frequency of the timestamp granularity, particularly > when it's not being queried (the vastly common case). > > The pessimal workload here is 1 byte writes, and it helps that > significantly. Of course, that's not a real-world workload. > > A tiobench-example.fio workload also shows some modest performance > gains, and I've gotten mails from the kernel test robot that show some > significant performance gains on some microbenchmarks (case-msync-mt in > the vm-scalability testsuite to be specific). > > I'm happy to run other workloads if anyone can suggest them. > > At this point, the patchset works and does what it's expected to do in > my own testing. It seems like it's at least a modest performance win > across all 3 major disk-based filesystems. It may also encourage others > to implement i_version as well since it reduces that cost. > > Is this an avenue that's worthwhile to pursue? > > Note that I think we may have other changes coming in the future that > will make this sort of cleanup necessary anyway. I'd like to plug in the > Ceph change attribute here eventually, and that will require something > like this anyway. > > Thoughts, comments and suggestions are welcome... > > --- > > [1]: On ext4 it must be turned on with the i_version mount option, > mostly due to fears of incurring this impact, AFAICT. > > [2]: NFS also recommends that it appear to increase in value over time, so > that clients can discard metadata updates that are older than ones > they've already seen. > > Jeff Layton (30): > lustre: don't set f_version in ll_readdir > ecryptfs: remove unnecessary i_version bump > ceph: remove the bump of i_version > f2fs: don't bother setting i_version > hpfs: don't bother with the i_version counter > jfs: remove initialization of i_version counter > nilfs2: remove inode->i_version initialization > orangefs: remove initialization of i_version > reiserfs: remove unneeded i_version bump > ntfs: remove i_version handling > fs: new API for handling i_version > fat: convert to new i_version API > affs: convert to new i_version API > afs: convert to new i_version API > btrfs: convert to new i_version API > exofs: switch to new i_version API >
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote: > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > > > Only btrfs, ext4, and xfs implement it for data changes. Because of > > this, these filesystems must log the inode to disk whenever the > > i_version counter changes. That has a non-zero performance impact, > > especially on write-heavy workloads, because we end up dirtying the > > inode metadata on every write, not just when the times change. [1] > > Do you have numbers to justify these changes? I have numbers. As to whether they justify the changes, I'm not sure. This helps a lot on a (admittedly nonsensical) 1-byte write workload. On XFS, with this fio jobfile: 8<-- [global] direct=0 size=2g filesize=512m bsrange=1-1 timeout=60 numjobs=1 directory=/mnt/scratch [f1] filename=randwrite rw=randwrite 8<-- Unpatched kernel: WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=6msec, maxt=6msec Patched kernel: WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=6msec, maxt=6msec So quite a difference there and it's pretty consistent across runs. If I change the jobfile to have "direct=1" and "bsrange=4k-4k", then any variation between the two doesn't seem to be significant (numbers vary as much between runs on the same kernels and are roughly the same). Playing with buffered I/O sizes between 1 byte and 4k shows that as the I/O sizes get larger, this makes less difference (which is what I'd expect). Previous testing with ext4 shows roughly the same results. btrfs shows some benefit here but significantly less than with ext4 or xfs. Not sure why that is yet -- maybe CoW effects? That said, I don't have a great test rig for this. I'm using VMs with a dedicated LVM volume that's on a random SSD I had laying around. It could use testing on a wider set of configurations and workloads. I was also hoping that others may have workloads that they think might be (postively or negatively) affected by these changes. If you can think of any in particular, then I'm interested to hear about them. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote: > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > > > > Only btrfs, ext4, and xfs implement it for data changes. Because of > > this, these filesystems must log the inode to disk whenever the > > i_version counter changes. That has a non-zero performance impact, > > especially on write-heavy workloads, because we end up dirtying the > > inode metadata on every write, not just when the times change. [1] > > Do you have numbers to justify these changes? I have numbers. As to whether they justify the changes, I'm not sure. This helps a lot on a (admittedly nonsensical) 1-byte write workload. On XFS, with this fio jobfile: 8<-- [global] direct=0 size=2g filesize=512m bsrange=1-1 timeout=60 numjobs=1 directory=/mnt/scratch [f1] filename=randwrite rw=randwrite 8<-- Unpatched kernel: WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=6msec, maxt=6msec Patched kernel: WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=6msec, maxt=6msec So quite a difference there and it's pretty consistent across runs. If I change the jobfile to have "direct=1" and "bsrange=4k-4k", then any variation between the two doesn't seem to be significant (numbers vary as much between runs on the same kernels and are roughly the same). Playing with buffered I/O sizes between 1 byte and 4k shows that as the I/O sizes get larger, this makes less difference (which is what I'd expect). Previous testing with ext4 shows roughly the same results. btrfs shows some benefit here but significantly less than with ext4 or xfs. Not sure why that is yet -- maybe CoW effects? That said, I don't have a great test rig for this. I'm using VMs with a dedicated LVM volume that's on a random SSD I had laying around. It could use testing on a wider set of configurations and workloads. I was also hoping that others may have workloads that they think might be (postively or negatively) affected by these changes. If you can think of any in particular, then I'm interested to hear about them. -- Jeff Layton
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > Only btrfs, ext4, and xfs implement it for data changes. Because of > this, these filesystems must log the inode to disk whenever the > i_version counter changes. That has a non-zero performance impact, > especially on write-heavy workloads, because we end up dirtying the > inode metadata on every write, not just when the times change. [1] Do you have numbers to justify these changes?
Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote: > Only btrfs, ext4, and xfs implement it for data changes. Because of > this, these filesystems must log the inode to disk whenever the > i_version counter changes. That has a non-zero performance impact, > especially on write-heavy workloads, because we end up dirtying the > inode metadata on every write, not just when the times change. [1] Do you have numbers to justify these changes?
[RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
tl;dr: I think we can greatly reduce the cost of the inode->i_version counter, by exploiting the fact that we don't need to increment it if no one is looking at it. We can also clean up the code to prepare to eventually expose this value via statx(). The inode->i_version field is supposed to be a value that changes whenever there is any data or metadata change to the inode. Some filesystems use it internally to detect directory changes during readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA will also use it (though it's not clear to me that that works 100% -- no check for MS_I_VERSION there). Only btrfs, ext4, and xfs implement it for data changes. Because of this, these filesystems must log the inode to disk whenever the i_version counter changes. That has a non-zero performance impact, especially on write-heavy workloads, because we end up dirtying the inode metadata on every write, not just when the times change. [1] It turns out though that none of these users of i_version require that i_version change on every change to the file. The only real requirement is that it be different if _something_ changed since the last time we queried for it. [2] So, if we simply keep track of when something queries the value, we can avoid bumping the counter and that metadata update. This patchset implements this: It starts with some small cleanup patches to just remove any mention of the i_version counter in filesystems that don't actually use it. Then, we add a new set of static inlines for managing the counter. The initial version should work identically to what we have now. Then, all of the remaining filesystems that use i_version are converted to the new inlines. Once that's in place, we switch to a new implementation that allows us to track readers of i_version counter, and only bump it when it's necessary or convenient (i.e. we're going to disk anyway). The final patch switches from a scheme that uses the i_lock to serialize the counter updates during write to an atomic64_t. That's a wash performance-wise in my testing, but I like not having to take the i_lock down where it could end up nested inside other locks. With this, we reduce inode metadata updates across all 3 filesystems down to roughly the frequency of the timestamp granularity, particularly when it's not being queried (the vastly common case). The pessimal workload here is 1 byte writes, and it helps that significantly. Of course, that's not a real-world workload. A tiobench-example.fio workload also shows some modest performance gains, and I've gotten mails from the kernel test robot that show some significant performance gains on some microbenchmarks (case-msync-mt in the vm-scalability testsuite to be specific). I'm happy to run other workloads if anyone can suggest them. At this point, the patchset works and does what it's expected to do in my own testing. It seems like it's at least a modest performance win across all 3 major disk-based filesystems. It may also encourage others to implement i_version as well since it reduces that cost. Is this an avenue that's worthwhile to pursue? Note that I think we may have other changes coming in the future that will make this sort of cleanup necessary anyway. I'd like to plug in the Ceph change attribute here eventually, and that will require something like this anyway. Thoughts, comments and suggestions are welcome... --- [1]: On ext4 it must be turned on with the i_version mount option, mostly due to fears of incurring this impact, AFAICT. [2]: NFS also recommends that it appear to increase in value over time, so that clients can discard metadata updates that are older than ones they've already seen. Jeff Layton (30): lustre: don't set f_version in ll_readdir ecryptfs: remove unnecessary i_version bump ceph: remove the bump of i_version f2fs: don't bother setting i_version hpfs: don't bother with the i_version counter jfs: remove initialization of i_version counter nilfs2: remove inode->i_version initialization orangefs: remove initialization of i_version reiserfs: remove unneeded i_version bump ntfs: remove i_version handling fs: new API for handling i_version fat: convert to new i_version API affs: convert to new i_version API afs: convert to new i_version API btrfs: convert to new i_version API exofs: switch to new i_version API ext2: convert to new i_version API ext4: convert to new i_version API nfs: convert to new i_version API nfsd: convert to new i_version API ocfs2: convert to new i_version API ufs: use new i_version API xfs: convert to new i_version API IMA: switch IMA over to new i_version API fs: add a "force" parameter to inode_inc_iversion fs: only set S_VERSION when updating times if it has been queried xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing btrfs: only dirty the inode in btrfs_update_time if something was changed fs: track whether the
[RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
tl;dr: I think we can greatly reduce the cost of the inode->i_version counter, by exploiting the fact that we don't need to increment it if no one is looking at it. We can also clean up the code to prepare to eventually expose this value via statx(). The inode->i_version field is supposed to be a value that changes whenever there is any data or metadata change to the inode. Some filesystems use it internally to detect directory changes during readdir. knfsd will use it if the filesystem has MS_I_VERSION set. IMA will also use it (though it's not clear to me that that works 100% -- no check for MS_I_VERSION there). Only btrfs, ext4, and xfs implement it for data changes. Because of this, these filesystems must log the inode to disk whenever the i_version counter changes. That has a non-zero performance impact, especially on write-heavy workloads, because we end up dirtying the inode metadata on every write, not just when the times change. [1] It turns out though that none of these users of i_version require that i_version change on every change to the file. The only real requirement is that it be different if _something_ changed since the last time we queried for it. [2] So, if we simply keep track of when something queries the value, we can avoid bumping the counter and that metadata update. This patchset implements this: It starts with some small cleanup patches to just remove any mention of the i_version counter in filesystems that don't actually use it. Then, we add a new set of static inlines for managing the counter. The initial version should work identically to what we have now. Then, all of the remaining filesystems that use i_version are converted to the new inlines. Once that's in place, we switch to a new implementation that allows us to track readers of i_version counter, and only bump it when it's necessary or convenient (i.e. we're going to disk anyway). The final patch switches from a scheme that uses the i_lock to serialize the counter updates during write to an atomic64_t. That's a wash performance-wise in my testing, but I like not having to take the i_lock down where it could end up nested inside other locks. With this, we reduce inode metadata updates across all 3 filesystems down to roughly the frequency of the timestamp granularity, particularly when it's not being queried (the vastly common case). The pessimal workload here is 1 byte writes, and it helps that significantly. Of course, that's not a real-world workload. A tiobench-example.fio workload also shows some modest performance gains, and I've gotten mails from the kernel test robot that show some significant performance gains on some microbenchmarks (case-msync-mt in the vm-scalability testsuite to be specific). I'm happy to run other workloads if anyone can suggest them. At this point, the patchset works and does what it's expected to do in my own testing. It seems like it's at least a modest performance win across all 3 major disk-based filesystems. It may also encourage others to implement i_version as well since it reduces that cost. Is this an avenue that's worthwhile to pursue? Note that I think we may have other changes coming in the future that will make this sort of cleanup necessary anyway. I'd like to plug in the Ceph change attribute here eventually, and that will require something like this anyway. Thoughts, comments and suggestions are welcome... --- [1]: On ext4 it must be turned on with the i_version mount option, mostly due to fears of incurring this impact, AFAICT. [2]: NFS also recommends that it appear to increase in value over time, so that clients can discard metadata updates that are older than ones they've already seen. Jeff Layton (30): lustre: don't set f_version in ll_readdir ecryptfs: remove unnecessary i_version bump ceph: remove the bump of i_version f2fs: don't bother setting i_version hpfs: don't bother with the i_version counter jfs: remove initialization of i_version counter nilfs2: remove inode->i_version initialization orangefs: remove initialization of i_version reiserfs: remove unneeded i_version bump ntfs: remove i_version handling fs: new API for handling i_version fat: convert to new i_version API affs: convert to new i_version API afs: convert to new i_version API btrfs: convert to new i_version API exofs: switch to new i_version API ext2: convert to new i_version API ext4: convert to new i_version API nfs: convert to new i_version API nfsd: convert to new i_version API ocfs2: convert to new i_version API ufs: use new i_version API xfs: convert to new i_version API IMA: switch IMA over to new i_version API fs: add a "force" parameter to inode_inc_iversion fs: only set S_VERSION when updating times if it has been queried xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing btrfs: only dirty the inode in btrfs_update_time if something was changed fs: track whether the