subject:"\[RFC PATCH v1 00\/30\] fs\: inode\->i_version rework and optimization"

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-10-30 Thread Jeff Layton

On Fri, 2017-05-12 at 12:21 -0400, J. Bruce Fields wrote:
> On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> > On Thu, May 11 2017, J. Bruce Fields wrote:
> > > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > > +{
> > > + u64 chattr;
> > > +
> > > + chattr = inode->i_ctime.tv_sec << 30;
> > > + chattr += inode->i_ctime.tv_nsec;
> > > + chattr += inode->i_version;
> > > + return chattr;
> > 
> > So if I chmod a file, all clients will need to flush the content from their 
> > cache?
> > Maybe they already do?  Maybe it is a boring corner case?
> 
> Yeah, that's the assumption, maybe it's wrong.  I can't recall
> complaints about anyone bitten by that case.
> 

I'm pretty sure that's required by the RFC. The change attribute changes
with both data and metadata changes, and there is no way to tell what
sort of change it was. You have to dump everything out of the cache when
it changes.

> > >  /*
> > >   * Fill in the pre_op attr for the wcc data
> > >   */
> > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > >   fhp->fh_pre_mtime = inode->i_mtime;
> > >   fhp->fh_pre_ctime = inode->i_ctime;
> > >   fhp->fh_pre_size  = inode->i_size;
> > > - fhp->fh_pre_change = inode->i_version;
> > > + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > >   fhp->fh_pre_saved = true;
> > >   }
> > >  }
> > > --- a/fs/nfsd/nfs3xdr.c
> > > +++ b/fs/nfsd/nfs3xdr.c
> > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > >   printk("nfsd: inode locked twice during operation.\n");
> > >  
> > >   err = fh_getattr(fhp, >fh_post_attr);
> > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > >   if (err) {
> > >   fhp->fh_post_saved = false;
> > >   /* Grab the ctime anyway - set_change_info might use it */
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 26780d53a6f9..a09532d4a383 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct 
> > > kstat *stat, struct inode *inode,
> > >   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > >   *p++ = 0;
> > >   } else if (IS_I_VERSION(inode)) {
> > > - p = xdr_encode_hyper(p, inode->i_version);
> > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > >   } else {
> > >   *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > >   *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> > 
> > It is *really* confusing to find that fh_post_change is only set in nfs3
> > code, and only used in nfs4 code.
> 
> Yup.
> 
> > It is probably time to get a 'version' field in 'struct kstat'.
> 
> The pre/post_wcc code doesn't seem to be doing an explicit stat, I
> wonder if that matters?
> 

Probably not for now. We only use this for namespace altering operations
anyway (create, link, unlink, and rename).

The post code does do a fh_getattr. It's only the pre-op i_version that
comes out of the cache. Only btrfs, xfs, and ext4 have a real i_version
counter today, and they just scrape that info out of the in-core inode.
So while not completely atomic, you should see a difference in the
change_info4 during any of those operations.

FWIW, userland cephfs now supports a cluster-coherent change attribute,
though the kernel client still needs some work before we can implement
it there. Eventually we'll add that, and at that point we might need to
have nfsd do a getattr in the pre part as well.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-10-30 Thread Jeff Layton

On Fri, 2017-05-12 at 12:21 -0400, J. Bruce Fields wrote:
> On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> > On Thu, May 11 2017, J. Bruce Fields wrote:
> > > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > > +{
> > > + u64 chattr;
> > > +
> > > + chattr = inode->i_ctime.tv_sec << 30;
> > > + chattr += inode->i_ctime.tv_nsec;
> > > + chattr += inode->i_version;
> > > + return chattr;
> > 
> > So if I chmod a file, all clients will need to flush the content from their 
> > cache?
> > Maybe they already do?  Maybe it is a boring corner case?
> 
> Yeah, that's the assumption, maybe it's wrong.  I can't recall
> complaints about anyone bitten by that case.
> 

I'm pretty sure that's required by the RFC. The change attribute changes
with both data and metadata changes, and there is no way to tell what
sort of change it was. You have to dump everything out of the cache when
it changes.

> > >  /*
> > >   * Fill in the pre_op attr for the wcc data
> > >   */
> > > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > >   fhp->fh_pre_mtime = inode->i_mtime;
> > >   fhp->fh_pre_ctime = inode->i_ctime;
> > >   fhp->fh_pre_size  = inode->i_size;
> > > - fhp->fh_pre_change = inode->i_version;
> > > + fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > >   fhp->fh_pre_saved = true;
> > >   }
> > >  }
> > > --- a/fs/nfsd/nfs3xdr.c
> > > +++ b/fs/nfsd/nfs3xdr.c
> > > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > >   printk("nfsd: inode locked twice during operation.\n");
> > >  
> > >   err = fh_getattr(fhp, >fh_post_attr);
> > > - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > > + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > >   if (err) {
> > >   fhp->fh_post_saved = false;
> > >   /* Grab the ctime anyway - set_change_info might use it */
> > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > > index 26780d53a6f9..a09532d4a383 100644
> > > --- a/fs/nfsd/nfs4xdr.c
> > > +++ b/fs/nfsd/nfs4xdr.c
> > > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct 
> > > kstat *stat, struct inode *inode,
> > >   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > >   *p++ = 0;
> > >   } else if (IS_I_VERSION(inode)) {
> > > - p = xdr_encode_hyper(p, inode->i_version);
> > > + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > >   } else {
> > >   *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > >   *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> > 
> > It is *really* confusing to find that fh_post_change is only set in nfs3
> > code, and only used in nfs4 code.
> 
> Yup.
> 
> > It is probably time to get a 'version' field in 'struct kstat'.
> 
> The pre/post_wcc code doesn't seem to be doing an explicit stat, I
> wonder if that matters?
> 

Probably not for now. We only use this for namespace altering operations
anyway (create, link, unlink, and rename).

The post code does do a fh_getattr. It's only the pre-op i_version that
comes out of the cache. Only btrfs, xfs, and ext4 have a real i_version
counter today, and they just scrape that info out of the in-core inode.
So while not completely atomic, you should see a difference in the
change_info4 during any of those operations.

FWIW, userland cephfs now supports a cluster-coherent change attribute,
though the kernel client still needs some work before we can implement
it there. Eventually we'll add that, and at that point we might need to
have nfsd do a getattr in the pre part as well.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields

On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> On Thu, May 11 2017, J. Bruce Fields wrote:
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > +   u64 chattr;
> > +
> > +   chattr = inode->i_ctime.tv_sec << 30;
> > +   chattr += inode->i_ctime.tv_nsec;
> > +   chattr += inode->i_version;
> > +   return chattr;
> 
> So if I chmod a file, all clients will need to flush the content from their 
> cache?
> Maybe they already do?  Maybe it is a boring corner case?

Yeah, that's the assumption, maybe it's wrong.  I can't recall
complaints about anyone bitten by that case.

> >  /*
> >   * Fill in the pre_op attr for the wcc data
> >   */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size  = inode->i_size;
> > -   fhp->fh_pre_change = inode->i_version;
> > +   fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> >  }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >  
> > err = fh_getattr(fhp, >fh_post_attr);
> > -   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > +   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> > *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > -   p = xdr_encode_hyper(p, inode->i_version);
> > +   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> 
> It is *really* confusing to find that fh_post_change is only set in nfs3
> code, and only used in nfs4 code.

Yup.

> It is probably time to get a 'version' field in 'struct kstat'.

The pre/post_wcc code doesn't seem to be doing an explicit stat, I
wonder if that matters?

--b.

> That would allow this code to get a little cleaner.
> 
> (to me, this exercise is just a reminder that the NFSv4 change attribute
> is poorly designed ... so it just makes me grumpy).
> 
> NeilBrown
> 
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields

On Fri, May 12, 2017 at 08:22:23AM +1000, NeilBrown wrote:
> On Thu, May 11 2017, J. Bruce Fields wrote:
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > +   u64 chattr;
> > +
> > +   chattr = inode->i_ctime.tv_sec << 30;
> > +   chattr += inode->i_ctime.tv_nsec;
> > +   chattr += inode->i_version;
> > +   return chattr;
> 
> So if I chmod a file, all clients will need to flush the content from their 
> cache?
> Maybe they already do?  Maybe it is a boring corner case?

Yeah, that's the assumption, maybe it's wrong.  I can't recall
complaints about anyone bitten by that case.

> >  /*
> >   * Fill in the pre_op attr for the wcc data
> >   */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size  = inode->i_size;
> > -   fhp->fh_pre_change = inode->i_version;
> > +   fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> >  }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >  
> > err = fh_getattr(fhp, >fh_post_attr);
> > -   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > +   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> > *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > -   p = xdr_encode_hyper(p, inode->i_version);
> > +   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> 
> It is *really* confusing to find that fh_post_change is only set in nfs3
> code, and only used in nfs4 code.

Yup.

> It is probably time to get a 'version' field in 'struct kstat'.

The pre/post_wcc code doesn't seem to be doing an explicit stat, I
wonder if that matters?

--b.

> That would allow this code to get a little cleaner.
> 
> (to me, this exercise is just a reminder that the NFSv4 change attribute
> is poorly designed ... so it just makes me grumpy).
> 
> NeilBrown
> 
> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields

On Fri, May 12, 2017 at 07:01:25AM -0400, Jeff Layton wrote:
> This looks reasonable to me (modulo Jan's comment about casting tv_sec
> to u64).
> 
> To be clear, I think this is mostly orthogonal to the changes that I was
> originally proposing, right? I think we can still benefit from only
> bumping and storing i_version values after they've been queried.

Definitely, yes.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields

On Fri, May 12, 2017 at 07:01:25AM -0400, Jeff Layton wrote:
> This looks reasonable to me (modulo Jan's comment about casting tv_sec
> to u64).
> 
> To be clear, I think this is mostly orthogonal to the changes that I was
> originally proposing, right? I think we can still benefit from only
> bumping and storing i_version values after they've been queried.

Definitely, yes.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields

On Fri, May 12, 2017 at 10:27:54AM +0200, Jan Kara wrote:
> On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > > 1) Keep i_version as is, make clients also check for i_ctime.
> > > 
> > > That would be a protocol revision, which we'd definitely rather avoid.
> > > 
> > > But can't we accomplish the same by using something like
> > > 
> > >   ctime * (some constant) + i_version
> > > 
> > > ?
> > > 
> > > >Pro: No on-disk format changes.
> > > >Cons: After a crash, i_version can go backwards (but when file 
> > > > changes
> > > >i_version, i_ctime pair should be still different) or not, data can 
> > > > be
> > > >old or not.
> > > 
> > > This is probably good enough for NFS purposes: typically on an NFS
> > > filesystem, results of a read in the face of a concurrent write open are
> > > undefined.  And writers sync before close.
> > > 
> > > So after a crash with a dirty inode, we're in a situation where an NFS
> > > client still needs to resend some writes, sync, and close.  I'm OK with
> > > things being inconsistent during this window.
> > > 
> > > I do expect things to return to normal once that client's has resent its
> > > writes--hence the worry about actually resuing old values after boot
> > > (such as if i_version regresses on boot and then increments back to the
> > > same value after further writes).  Factoring in ctime fixes that.
> > 
> > So for now I'm thinking of just doing something like the following.
> > 
> > Only nfsd needs it for now, but it could be moved to a vfs helper for
> > statx, or for individual filesystems that want to do something
> > different.  (The NFSv4 client will want to use the server's change
> > attribute instead, I think.  And other filesystems might want to try
> > something more ambitious like Neil's proposal.)
> > 
> > --b.
> > 
> > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > index 12feac6ee2fd..9636c9a60aba 100644
> > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> > index f84fe6bf9aee..14f09f1ef605 100644
> > --- a/fs/nfsd/nfsfh.h
> > +++ b/fs/nfsd/nfsfh.h
> > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_saved = false;
> >  }
> >  
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > +   u64 chattr;
> > +
> > +   chattr = inode->i_ctime.tv_sec << 30;
> 
> Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
> Probably you need explicit (u64) cast... Otherwise I'm fine with this.

Whoops, yes.  Or just assign to chattr as a separate step.  I'll fix
that.

--b.

> > +   chattr += inode->i_ctime.tv_nsec;
> > +   chattr += inode->i_version;
> > +   return chattr;
> > +}
> > +
> >  /*
> >   * Fill in the pre_op attr for the wcc data
> >   */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size  = inode->i_size;
> > -   fhp->fh_pre_change = inode->i_version;
> > +   fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> >  }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >  
> > err = fh_getattr(fhp, >fh_post_attr);
> > -   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > +   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> > *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > -   p = xdr_encode_hyper(p, inode->i_version);
> > +   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> -- 
> Jan Kara 
> SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread J. Bruce Fields

On Fri, May 12, 2017 at 10:27:54AM +0200, Jan Kara wrote:
> On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > > 1) Keep i_version as is, make clients also check for i_ctime.
> > > 
> > > That would be a protocol revision, which we'd definitely rather avoid.
> > > 
> > > But can't we accomplish the same by using something like
> > > 
> > >   ctime * (some constant) + i_version
> > > 
> > > ?
> > > 
> > > >Pro: No on-disk format changes.
> > > >Cons: After a crash, i_version can go backwards (but when file 
> > > > changes
> > > >i_version, i_ctime pair should be still different) or not, data can 
> > > > be
> > > >old or not.
> > > 
> > > This is probably good enough for NFS purposes: typically on an NFS
> > > filesystem, results of a read in the face of a concurrent write open are
> > > undefined.  And writers sync before close.
> > > 
> > > So after a crash with a dirty inode, we're in a situation where an NFS
> > > client still needs to resend some writes, sync, and close.  I'm OK with
> > > things being inconsistent during this window.
> > > 
> > > I do expect things to return to normal once that client's has resent its
> > > writes--hence the worry about actually resuing old values after boot
> > > (such as if i_version regresses on boot and then increments back to the
> > > same value after further writes).  Factoring in ctime fixes that.
> > 
> > So for now I'm thinking of just doing something like the following.
> > 
> > Only nfsd needs it for now, but it could be moved to a vfs helper for
> > statx, or for individual filesystems that want to do something
> > different.  (The NFSv4 client will want to use the server's change
> > attribute instead, I think.  And other filesystems might want to try
> > something more ambitious like Neil's proposal.)
> > 
> > --b.
> > 
> > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> > index 12feac6ee2fd..9636c9a60aba 100644
> > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> > index f84fe6bf9aee..14f09f1ef605 100644
> > --- a/fs/nfsd/nfsfh.h
> > +++ b/fs/nfsd/nfsfh.h
> > @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_saved = false;
> >  }
> >  
> > +static inline u64 nfsd4_change_attribute(struct inode *inode)
> > +{
> > +   u64 chattr;
> > +
> > +   chattr = inode->i_ctime.tv_sec << 30;
> 
> Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
> Probably you need explicit (u64) cast... Otherwise I'm fine with this.

Whoops, yes.  Or just assign to chattr as a separate step.  I'll fix
that.

--b.

> > +   chattr += inode->i_ctime.tv_nsec;
> > +   chattr += inode->i_version;
> > +   return chattr;
> > +}
> > +
> >  /*
> >   * Fill in the pre_op attr for the wcc data
> >   */
> > @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
> > fhp->fh_pre_mtime = inode->i_mtime;
> > fhp->fh_pre_ctime = inode->i_ctime;
> > fhp->fh_pre_size  = inode->i_size;
> > -   fhp->fh_pre_change = inode->i_version;
> > +   fhp->fh_pre_change = nfsd4_change_attribute(inode);
> > fhp->fh_pre_saved = true;
> > }
> >  }
> > --- a/fs/nfsd/nfs3xdr.c
> > +++ b/fs/nfsd/nfs3xdr.c
> > @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
> > printk("nfsd: inode locked twice during operation.\n");
> >  
> > err = fh_getattr(fhp, >fh_post_attr);
> > -   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> > +   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
> > if (err) {
> > fhp->fh_post_saved = false;
> > /* Grab the ctime anyway - set_change_info might use it */
> > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> > index 26780d53a6f9..a09532d4a383 100644
> > --- a/fs/nfsd/nfs4xdr.c
> > +++ b/fs/nfsd/nfs4xdr.c
> > @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> > *stat, struct inode *inode,
> > *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
> > *p++ = 0;
> > } else if (IS_I_VERSION(inode)) {
> > -   p = xdr_encode_hyper(p, inode->i_version);
> > +   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
> > } else {
> > *p++ = cpu_to_be32(stat->ctime.tv_sec);
> > *p++ = cpu_to_be32(stat->ctime.tv_nsec);
> -- 
> Jan Kara 
> SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread Jeff Layton

On Thu, 2017-05-11 at 14:59 -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> > 
> > That would be a protocol revision, which we'd definitely rather avoid.
> > 
> > But can't we accomplish the same by using something like
> > 
> > ctime * (some constant) + i_version
> > 
> > ?
> > 
> > >Pro: No on-disk format changes.
> > >Cons: After a crash, i_version can go backwards (but when file changes
> > >i_version, i_ctime pair should be still different) or not, data can be
> > >old or not.
> > 
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined.  And writers sync before close.
> > 
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close.  I'm OK with
> > things being inconsistent during this window.
> > 
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes).  Factoring in ctime fixes that.
> 
> So for now I'm thinking of just doing something like the following.
> 
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
> 
> --b.
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);


Sorry I've been MIA on this discussion. I've had a very busy spring...

This looks reasonable to me (modulo Jan's comment about casting tv_sec
to u64).

To be clear, I think this is mostly orthogonal to the changes that I was
originally proposing, right? I think we can still benefit from only
bumping and storing i_version values after they've been queried.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread Jeff Layton

On Thu, 2017-05-11 at 14:59 -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> > 
> > That would be a protocol revision, which we'd definitely rather avoid.
> > 
> > But can't we accomplish the same by using something like
> > 
> > ctime * (some constant) + i_version
> > 
> > ?
> > 
> > >Pro: No on-disk format changes.
> > >Cons: After a crash, i_version can go backwards (but when file changes
> > >i_version, i_ctime pair should be still different) or not, data can be
> > >old or not.
> > 
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined.  And writers sync before close.
> > 
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close.  I'm OK with
> > things being inconsistent during this window.
> > 
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes).  Factoring in ctime fixes that.
> 
> So for now I'm thinking of just doing something like the following.
> 
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
> 
> --b.
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);


Sorry I've been MIA on this discussion. I've had a very busy spring...

This looks reasonable to me (modulo Jan's comment about casting tv_sec
to u64).

To be clear, I think this is mostly orthogonal to the changes that I was
originally proposing, right? I think we can still benefit from only
bumping and storing i_version values after they've been queried.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread Jan Kara

On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> > 
> > That would be a protocol revision, which we'd definitely rather avoid.
> > 
> > But can't we accomplish the same by using something like
> > 
> > ctime * (some constant) + i_version
> > 
> > ?
> > 
> > >Pro: No on-disk format changes.
> > >Cons: After a crash, i_version can go backwards (but when file changes
> > >i_version, i_ctime pair should be still different) or not, data can be
> > >old or not.
> > 
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined.  And writers sync before close.
> > 
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close.  I'm OK with
> > things being inconsistent during this window.
> > 
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes).  Factoring in ctime fixes that.
> 
> So for now I'm thinking of just doing something like the following.
> 
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
> 
> --b.
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;

Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
Probably you need explicit (u64) cast... Otherwise I'm fine with this.

Honza

> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-12 Thread Jan Kara

On Thu 11-05-17 14:59:43, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > > 1) Keep i_version as is, make clients also check for i_ctime.
> > 
> > That would be a protocol revision, which we'd definitely rather avoid.
> > 
> > But can't we accomplish the same by using something like
> > 
> > ctime * (some constant) + i_version
> > 
> > ?
> > 
> > >Pro: No on-disk format changes.
> > >Cons: After a crash, i_version can go backwards (but when file changes
> > >i_version, i_ctime pair should be still different) or not, data can be
> > >old or not.
> > 
> > This is probably good enough for NFS purposes: typically on an NFS
> > filesystem, results of a read in the face of a concurrent write open are
> > undefined.  And writers sync before close.
> > 
> > So after a crash with a dirty inode, we're in a situation where an NFS
> > client still needs to resend some writes, sync, and close.  I'm OK with
> > things being inconsistent during this window.
> > 
> > I do expect things to return to normal once that client's has resent its
> > writes--hence the worry about actually resuing old values after boot
> > (such as if i_version regresses on boot and then increments back to the
> > same value after further writes).  Factoring in ctime fixes that.
> 
> So for now I'm thinking of just doing something like the following.
> 
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
> 
> --b.
> 
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;

Won't this overflow on 32-bit archs? tv_sec seems to be defined as long?
Probably you need explicit (u64) cast... Otherwise I'm fine with this.

Honza

> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;
> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-11 Thread NeilBrown

On Thu, May 11 2017, J. Bruce Fields wrote:

> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
>> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
>> > 1) Keep i_version as is, make clients also check for i_ctime.
>> 
>> That would be a protocol revision, which we'd definitely rather avoid.
>> 
>> But can't we accomplish the same by using something like
>> 
>>  ctime * (some constant) + i_version
>> 
>> ?
>> 
>> >Pro: No on-disk format changes.
>> >Cons: After a crash, i_version can go backwards (but when file changes
>> >i_version, i_ctime pair should be still different) or not, data can be
>> >old or not.
>> 
>> This is probably good enough for NFS purposes: typically on an NFS
>> filesystem, results of a read in the face of a concurrent write open are
>> undefined.  And writers sync before close.
>> 
>> So after a crash with a dirty inode, we're in a situation where an NFS
>> client still needs to resend some writes, sync, and close.  I'm OK with
>> things being inconsistent during this window.
>> 
>> I do expect things to return to normal once that client's has resent its
>> writes--hence the worry about actually resuing old values after boot
>> (such as if i_version regresses on boot and then increments back to the
>> same value after further writes).  Factoring in ctime fixes that.
>
> So for now I'm thinking of just doing something like the following.
>
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
>
> --b.
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;

So if I chmod a file, all clients will need to flush the content from their 
cache?
Maybe they already do?  Maybe it is a boring corner case?

> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);

It is *really* confusing to find that fh_post_change is only set in nfs3
code, and only used in nfs4 code.
It is probably time to get a 'version' field in 'struct kstat'.
That would allow this code to get a little cleaner.

(to me, this exercise is just a reminder that the NFSv4 change attribute
is poorly designed ... so it just makes me grumpy).

NeilBrown


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-11 Thread NeilBrown

On Thu, May 11 2017, J. Bruce Fields wrote:

> On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
>> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
>> > 1) Keep i_version as is, make clients also check for i_ctime.
>> 
>> That would be a protocol revision, which we'd definitely rather avoid.
>> 
>> But can't we accomplish the same by using something like
>> 
>>  ctime * (some constant) + i_version
>> 
>> ?
>> 
>> >Pro: No on-disk format changes.
>> >Cons: After a crash, i_version can go backwards (but when file changes
>> >i_version, i_ctime pair should be still different) or not, data can be
>> >old or not.
>> 
>> This is probably good enough for NFS purposes: typically on an NFS
>> filesystem, results of a read in the face of a concurrent write open are
>> undefined.  And writers sync before close.
>> 
>> So after a crash with a dirty inode, we're in a situation where an NFS
>> client still needs to resend some writes, sync, and close.  I'm OK with
>> things being inconsistent during this window.
>> 
>> I do expect things to return to normal once that client's has resent its
>> writes--hence the worry about actually resuing old values after boot
>> (such as if i_version regresses on boot and then increments back to the
>> same value after further writes).  Factoring in ctime fixes that.
>
> So for now I'm thinking of just doing something like the following.
>
> Only nfsd needs it for now, but it could be moved to a vfs helper for
> statx, or for individual filesystems that want to do something
> different.  (The NFSv4 client will want to use the server's change
> attribute instead, I think.  And other filesystems might want to try
> something more ambitious like Neil's proposal.)
>
> --b.
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 12feac6ee2fd..9636c9a60aba 100644
> diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
> index f84fe6bf9aee..14f09f1ef605 100644
> --- a/fs/nfsd/nfsfh.h
> +++ b/fs/nfsd/nfsfh.h
> @@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_saved = false;
>  }
>  
> +static inline u64 nfsd4_change_attribute(struct inode *inode)
> +{
> + u64 chattr;
> +
> + chattr = inode->i_ctime.tv_sec << 30;
> + chattr += inode->i_ctime.tv_nsec;
> + chattr += inode->i_version;
> + return chattr;

So if I chmod a file, all clients will need to flush the content from their 
cache?
Maybe they already do?  Maybe it is a boring corner case?

> +}
> +
>  /*
>   * Fill in the pre_op attr for the wcc data
>   */
> @@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
>   fhp->fh_pre_mtime = inode->i_mtime;
>   fhp->fh_pre_ctime = inode->i_ctime;
>   fhp->fh_pre_size  = inode->i_size;
> - fhp->fh_pre_change = inode->i_version;
> + fhp->fh_pre_change = nfsd4_change_attribute(inode);
>   fhp->fh_pre_saved = true;
>   }
>  }
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
>   printk("nfsd: inode locked twice during operation.\n");
>  
>   err = fh_getattr(fhp, >fh_post_attr);
> - fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
> + fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
>   if (err) {
>   fhp->fh_post_saved = false;
>   /* Grab the ctime anyway - set_change_info might use it */
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 26780d53a6f9..a09532d4a383 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
> *stat, struct inode *inode,
>   *p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
>   *p++ = 0;
>   } else if (IS_I_VERSION(inode)) {
> - p = xdr_encode_hyper(p, inode->i_version);
> + p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
>   } else {
>   *p++ = cpu_to_be32(stat->ctime.tv_sec);
>   *p++ = cpu_to_be32(stat->ctime.tv_nsec);

It is *really* confusing to find that fh_post_change is only set in nfs3
code, and only used in nfs4 code.
It is probably time to get a 'version' field in 'struct kstat'.
That would allow this code to get a little cleaner.

(to me, this exercise is just a reminder that the NFSv4 change attribute
is poorly designed ... so it just makes me grumpy).

NeilBrown


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: PGP signature

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-11 Thread J. Bruce Fields

On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > 1) Keep i_version as is, make clients also check for i_ctime.
> 
> That would be a protocol revision, which we'd definitely rather avoid.
> 
> But can't we accomplish the same by using something like
> 
>   ctime * (some constant) + i_version
> 
> ?
> 
> >Pro: No on-disk format changes.
> >Cons: After a crash, i_version can go backwards (but when file changes
> >i_version, i_ctime pair should be still different) or not, data can be
> >old or not.
> 
> This is probably good enough for NFS purposes: typically on an NFS
> filesystem, results of a read in the face of a concurrent write open are
> undefined.  And writers sync before close.
> 
> So after a crash with a dirty inode, we're in a situation where an NFS
> client still needs to resend some writes, sync, and close.  I'm OK with
> things being inconsistent during this window.
> 
> I do expect things to return to normal once that client's has resent its
> writes--hence the worry about actually resuing old values after boot
> (such as if i_version regresses on boot and then increments back to the
> same value after further writes).  Factoring in ctime fixes that.

So for now I'm thinking of just doing something like the following.

Only nfsd needs it for now, but it could be moved to a vfs helper for
statx, or for individual filesystems that want to do something
different.  (The NFSv4 client will want to use the server's change
attribute instead, I think.  And other filesystems might want to try
something more ambitious like Neil's proposal.)

--b.

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 12feac6ee2fd..9636c9a60aba 100644
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f84fe6bf9aee..14f09f1ef605 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
fhp->fh_pre_saved = false;
 }
 
+static inline u64 nfsd4_change_attribute(struct inode *inode)
+{
+   u64 chattr;
+
+   chattr = inode->i_ctime.tv_sec << 30;
+   chattr += inode->i_ctime.tv_nsec;
+   chattr += inode->i_version;
+   return chattr;
+}
+
 /*
  * Fill in the pre_op attr for the wcc data
  */
@@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
fhp->fh_pre_mtime = inode->i_mtime;
fhp->fh_pre_ctime = inode->i_ctime;
fhp->fh_pre_size  = inode->i_size;
-   fhp->fh_pre_change = inode->i_version;
+   fhp->fh_pre_change = nfsd4_change_attribute(inode);
fhp->fh_pre_saved = true;
}
 }
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
printk("nfsd: inode locked twice during operation.\n");
 
err = fh_getattr(fhp, >fh_post_attr);
-   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
+   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
if (err) {
fhp->fh_post_saved = false;
/* Grab the ctime anyway - set_change_info might use it */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 26780d53a6f9..a09532d4a383 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
*stat, struct inode *inode,
*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
*p++ = 0;
} else if (IS_I_VERSION(inode)) {
-   p = xdr_encode_hyper(p, inode->i_version);
+   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
} else {
*p++ = cpu_to_be32(stat->ctime.tv_sec);
*p++ = cpu_to_be32(stat->ctime.tv_nsec);

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-05-11 Thread J. Bruce Fields

On Wed, Apr 05, 2017 at 02:14:09PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> > 1) Keep i_version as is, make clients also check for i_ctime.
> 
> That would be a protocol revision, which we'd definitely rather avoid.
> 
> But can't we accomplish the same by using something like
> 
>   ctime * (some constant) + i_version
> 
> ?
> 
> >Pro: No on-disk format changes.
> >Cons: After a crash, i_version can go backwards (but when file changes
> >i_version, i_ctime pair should be still different) or not, data can be
> >old or not.
> 
> This is probably good enough for NFS purposes: typically on an NFS
> filesystem, results of a read in the face of a concurrent write open are
> undefined.  And writers sync before close.
> 
> So after a crash with a dirty inode, we're in a situation where an NFS
> client still needs to resend some writes, sync, and close.  I'm OK with
> things being inconsistent during this window.
> 
> I do expect things to return to normal once that client's has resent its
> writes--hence the worry about actually resuing old values after boot
> (such as if i_version regresses on boot and then increments back to the
> same value after further writes).  Factoring in ctime fixes that.

So for now I'm thinking of just doing something like the following.

Only nfsd needs it for now, but it could be moved to a vfs helper for
statx, or for individual filesystems that want to do something
different.  (The NFSv4 client will want to use the server's change
attribute instead, I think.  And other filesystems might want to try
something more ambitious like Neil's proposal.)

--b.

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 12feac6ee2fd..9636c9a60aba 100644
diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h
index f84fe6bf9aee..14f09f1ef605 100644
--- a/fs/nfsd/nfsfh.h
+++ b/fs/nfsd/nfsfh.h
@@ -240,6 +240,16 @@ fh_clear_wcc(struct svc_fh *fhp)
fhp->fh_pre_saved = false;
 }
 
+static inline u64 nfsd4_change_attribute(struct inode *inode)
+{
+   u64 chattr;
+
+   chattr = inode->i_ctime.tv_sec << 30;
+   chattr += inode->i_ctime.tv_nsec;
+   chattr += inode->i_version;
+   return chattr;
+}
+
 /*
  * Fill in the pre_op attr for the wcc data
  */
@@ -253,7 +263,7 @@ fill_pre_wcc(struct svc_fh *fhp)
fhp->fh_pre_mtime = inode->i_mtime;
fhp->fh_pre_ctime = inode->i_ctime;
fhp->fh_pre_size  = inode->i_size;
-   fhp->fh_pre_change = inode->i_version;
+   fhp->fh_pre_change = nfsd4_change_attribute(inode);
fhp->fh_pre_saved = true;
}
 }
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -260,7 +260,7 @@ void fill_post_wcc(struct svc_fh *fhp)
printk("nfsd: inode locked twice during operation.\n");
 
err = fh_getattr(fhp, >fh_post_attr);
-   fhp->fh_post_change = d_inode(fhp->fh_dentry)->i_version;
+   fhp->fh_post_change = nfsd4_change_attribute(d_inode(fhp->fh_dentry));
if (err) {
fhp->fh_post_saved = false;
/* Grab the ctime anyway - set_change_info might use it */
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 26780d53a6f9..a09532d4a383 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1973,7 +1973,7 @@ static __be32 *encode_change(__be32 *p, struct kstat 
*stat, struct inode *inode,
*p++ = cpu_to_be32(convert_to_wallclock(exp->cd->flush_time));
*p++ = 0;
} else if (IS_I_VERSION(inode)) {
-   p = xdr_encode_hyper(p, inode->i_version);
+   p = xdr_encode_hyper(p, nfsd4_change_attribute(inode));
} else {
*p++ = cpu_to_be32(stat->ctime.tv_sec);
*p++ = cpu_to_be32(stat->ctime.tv_nsec);

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-06 Thread Jan Kara

On Thu 06-04-17 11:12:02, NeilBrown wrote:
> On Wed, Apr 05 2017, Jan Kara wrote:
> >> If you want to ensure read-only files can remain cached over a crash,
> >> then you would have to mark a file in some way on stable storage
> >> *before* allowing any change.
> >> e.g. you could use the lsb.  Odd i_versions might have been changed
> >> recently and crash-count*large-number needs to be added.
> >> Even i_versions have not been changed recently and nothing need be
> >> added.
> >> 
> >> If you want to change a file with an even i_version, you subtract
> >>   crash-count*large-number
> >> to the i_version, then set lsb.  This is written to stable storage before
> >> the change.
> >> 
> >> If a file has not been changed for a while, you can add
> >>   crash-count*large-number
> >> and clear lsb.
> >> 
> >> The lsb of the i_version would be for internal use only.  It would not
> >> be visible outside the filesystem.
> >> 
> >> It feels a bit clunky, but I think it would work and is the best
> >> combination of Jan's idea and your requirement.
> >> The biggest cost would be switching to 'odd' before an changes, and the
> >> unknown is when does it make sense to switch to 'even'.
> >
> > Well, there is also a problem that you would need to somehow remember with
> > which 'crash count' the i_version has been previously reported as that is
> > not stored on disk with my scheme. So I don't think we can easily use your
> > scheme.
> 
> I don't think there is a problem here maybe I didn't explain
> properly or something.
> 
> I'm assuming there is a crash-count that is stored once per filesystem.
> This might be a disk-format change, or maybe the "Last checked" time
> could be used with ext4 (that is a bit horrible though).
> 
> Every on-disk i_version has a flag to choose between:
>   - use this number as it is, but update it on-disk before any change
>   - add multiple of current crash-count to this number before use.
>   If you crash during an update, the i_version is thus automatically
>   increased.
> 
> To change from the first option to the second option you subtract the
> multiple of the current crash-count (which might make the stored
> i_version negative), and flip the bit.
> To change from the second option to the first, you add the multiple
> of the current crash-count, and flip the bit.
> In each case, the externally visible i_version does not change.
> Nothing needs to be stored except the per-inode i_version and the per-fs
> crash_count. 

Right, I didn't realize you would subtract crash counter when flipping the
bit and then add it back when flipping again. That would work.

> > So the options we have are:
> >
> > 1) Keep i_version as is, make clients also check for i_ctime.
> >Pro: No on-disk format changes.
> >Cons: After a crash, i_version can go backwards (but when file changes
> >i_version, i_ctime pair should be still different) or not, data can be
> >old or not.
> 
> I like to think of this approach as using the i_version as an extension
> to the i_ctime.
> i_ctime doesn't necessarily change on every file modification, either
> because it is not a modification that is meant to change i_ctime, or
> because i_ctime doesn't have the resolution to show a very small change
> in time, or because the clock that is used to update i_ctime doesn't
> have much resolution.
> So when a change happens, if the stored c_time changes, set i_version to
> zero, otherwise increment i_version.
> Then the externally visible i-version is a combination of the stored
> c_time and the stored i_version.
> If you only used 1-second ctime resolution for versioning purposes, you
> could provide a 64bit i_version as 34 bits of ctime and 30 bits of
> changes-in-one-second.
> It is important that the resolution of ctime used is less that the
> fastest possible restart after a crash.
> 
> I don't think that i_version going backwards should be a problem, as
> long as an old version means exactly the same old data.  Presumably
> journalling would ensure that the data and ctime/version are updated
> atomically.

So as Dave and I wrote earlier in this thread, journalling does not ensure
data vs ctime/version consistency (well, except for ext4 in data=journal
mode but people rarely run that due to performance implications). So you
can get old data and new version as well as new data and old version after
a crash. The only thing filesystems guarantee is that you will not see
uninitialized blocks and that fsync makes both data & ctime/version
persistent. But as Bruce wrote for NFS open-to-close semantics this may be
actually good enough.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-06 Thread Jan Kara

On Thu 06-04-17 11:12:02, NeilBrown wrote:
> On Wed, Apr 05 2017, Jan Kara wrote:
> >> If you want to ensure read-only files can remain cached over a crash,
> >> then you would have to mark a file in some way on stable storage
> >> *before* allowing any change.
> >> e.g. you could use the lsb.  Odd i_versions might have been changed
> >> recently and crash-count*large-number needs to be added.
> >> Even i_versions have not been changed recently and nothing need be
> >> added.
> >> 
> >> If you want to change a file with an even i_version, you subtract
> >>   crash-count*large-number
> >> to the i_version, then set lsb.  This is written to stable storage before
> >> the change.
> >> 
> >> If a file has not been changed for a while, you can add
> >>   crash-count*large-number
> >> and clear lsb.
> >> 
> >> The lsb of the i_version would be for internal use only.  It would not
> >> be visible outside the filesystem.
> >> 
> >> It feels a bit clunky, but I think it would work and is the best
> >> combination of Jan's idea and your requirement.
> >> The biggest cost would be switching to 'odd' before an changes, and the
> >> unknown is when does it make sense to switch to 'even'.
> >
> > Well, there is also a problem that you would need to somehow remember with
> > which 'crash count' the i_version has been previously reported as that is
> > not stored on disk with my scheme. So I don't think we can easily use your
> > scheme.
> 
> I don't think there is a problem here maybe I didn't explain
> properly or something.
> 
> I'm assuming there is a crash-count that is stored once per filesystem.
> This might be a disk-format change, or maybe the "Last checked" time
> could be used with ext4 (that is a bit horrible though).
> 
> Every on-disk i_version has a flag to choose between:
>   - use this number as it is, but update it on-disk before any change
>   - add multiple of current crash-count to this number before use.
>   If you crash during an update, the i_version is thus automatically
>   increased.
> 
> To change from the first option to the second option you subtract the
> multiple of the current crash-count (which might make the stored
> i_version negative), and flip the bit.
> To change from the second option to the first, you add the multiple
> of the current crash-count, and flip the bit.
> In each case, the externally visible i_version does not change.
> Nothing needs to be stored except the per-inode i_version and the per-fs
> crash_count. 

Right, I didn't realize you would subtract crash counter when flipping the
bit and then add it back when flipping again. That would work.

> > So the options we have are:
> >
> > 1) Keep i_version as is, make clients also check for i_ctime.
> >Pro: No on-disk format changes.
> >Cons: After a crash, i_version can go backwards (but when file changes
> >i_version, i_ctime pair should be still different) or not, data can be
> >old or not.
> 
> I like to think of this approach as using the i_version as an extension
> to the i_ctime.
> i_ctime doesn't necessarily change on every file modification, either
> because it is not a modification that is meant to change i_ctime, or
> because i_ctime doesn't have the resolution to show a very small change
> in time, or because the clock that is used to update i_ctime doesn't
> have much resolution.
> So when a change happens, if the stored c_time changes, set i_version to
> zero, otherwise increment i_version.
> Then the externally visible i-version is a combination of the stored
> c_time and the stored i_version.
> If you only used 1-second ctime resolution for versioning purposes, you
> could provide a 64bit i_version as 34 bits of ctime and 30 bits of
> changes-in-one-second.
> It is important that the resolution of ctime used is less that the
> fastest possible restart after a crash.
> 
> I don't think that i_version going backwards should be a problem, as
> long as an old version means exactly the same old data.  Presumably
> journalling would ensure that the data and ctime/version are updated
> atomically.

So as Dave and I wrote earlier in this thread, journalling does not ensure
data vs ctime/version consistency (well, except for ext4 in data=journal
mode but people rarely run that due to performance implications). So you
can get old data and new version as well as new data and old version after
a crash. The only thing filesystems guarantee is that you will not see
uninitialized blocks and that fsync makes both data & ctime/version
persistent. But as Bruce wrote for NFS open-to-close semantics this may be
actually good enough.

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Jan Kara wrote:

> On Wed 05-04-17 11:43:32, NeilBrown wrote:
>> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>> 
>> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> >> > > > Because if above is acceptable we could make reported i_version to 
>> >> > > > be a sum
>> >> > > > of "superblock crash counter" and "inode i_version". We increment
>> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> >> > > > shutdown.
>> >> > > > That way after a crash we are guaranteed each inode will report new
>> >> > > > i_version (the sum would probably have to look like "superblock 
>> >> > > > crash
>> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> >> > > > possible
>> >> > > > i_version numbers we gave away but did not write to disk but 
>> >> > > > still...).
>> >> > > > Thoughts?
>> >> > 
>> >> > How hard is this for filesystems to support?  Do they need an on-disk
>> >> > format change to keep track of the crash counter?  Maybe not, maybe the
>> >> > high bits of the i_version counters are all they need.
>> >> > 
>> >> 
>> >> Yeah, I imagine we'd need a on-disk change for this unless there's
>> >> something already present that we could use in place of a crash counter.
>> >
>> > We could consider using the current time instead.  So, put the current
>> > time (or time of last boot, or this inode's ctime, or something) in the
>> > high bits of the change attribute, and keep the low bits as a counter.
>> 
>> This is a very different proposal.
>> I don't think Jan was suggesting that the i_version be split into two
>> bit fields, one the change-counter and one the crash-counter.
>> Rather, the crash-counter was multiplied by a large-number and added to
>> the change-counter with the expectation that while not ever
>> change-counter landed on disk, at least 1 in every large-number would.
>> So after each crash we effectively add large-number to the
>> change-counter, and can be sure that number hasn't been used already.
>
> Yes, that was my thinking.
>
>> To store the crash-counter in each inode (which does appeal) you would
>> need to be able to remove it before adding the new crash counter, and
>> that requires bit-fields.  Maybe there are enough bits.
>
> Furthermore you'd have a potential problem that you need to change
> i_version on disk just because you are reading after a crash and such
> changes tend to be problematic (think of read-only mounts and stuff like
> that).
>  
>> If you want to ensure read-only files can remain cached over a crash,
>> then you would have to mark a file in some way on stable storage
>> *before* allowing any change.
>> e.g. you could use the lsb.  Odd i_versions might have been changed
>> recently and crash-count*large-number needs to be added.
>> Even i_versions have not been changed recently and nothing need be
>> added.
>> 
>> If you want to change a file with an even i_version, you subtract
>>   crash-count*large-number
>> to the i_version, then set lsb.  This is written to stable storage before
>> the change.
>> 
>> If a file has not been changed for a while, you can add
>>   crash-count*large-number
>> and clear lsb.
>> 
>> The lsb of the i_version would be for internal use only.  It would not
>> be visible outside the filesystem.
>> 
>> It feels a bit clunky, but I think it would work and is the best
>> combination of Jan's idea and your requirement.
>> The biggest cost would be switching to 'odd' before an changes, and the
>> unknown is when does it make sense to switch to 'even'.
>
> Well, there is also a problem that you would need to somehow remember with
> which 'crash count' the i_version has been previously reported as that is
> not stored on disk with my scheme. So I don't think we can easily use your
> scheme.

I don't think there is a problem here maybe I didn't explain
properly or something.

I'm assuming there is a crash-count that is stored once per filesystem.
This might be a disk-format change, or maybe the "Last checked" time
could be used with ext4 (that is a bit horrible though).

Every on-disk i_version has a flag to choose between:
  - use this number as it is, but update it on-disk before any change
  - add multiple of current crash-count to this number before use.
  If you crash during an update, the i_version is thus automatically
  increased.

To change from the first option to the second option you subtract the
multiple of the current crash-count (which might make the stored
i_version negative), and flip the bit.
To change from the second option to the first, you add the multiple
of the current crash-count, and flip the bit.
In each case, the externally visible i_version does not change.
Nothing needs to be stored except the per-inode i_version and the per-fs

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Jan Kara wrote:

> On Wed 05-04-17 11:43:32, NeilBrown wrote:
>> On Tue, Apr 04 2017, J. Bruce Fields wrote:
>> 
>> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> >> > > > Because if above is acceptable we could make reported i_version to 
>> >> > > > be a sum
>> >> > > > of "superblock crash counter" and "inode i_version". We increment
>> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> >> > > > shutdown.
>> >> > > > That way after a crash we are guaranteed each inode will report new
>> >> > > > i_version (the sum would probably have to look like "superblock 
>> >> > > > crash
>> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> >> > > > possible
>> >> > > > i_version numbers we gave away but did not write to disk but 
>> >> > > > still...).
>> >> > > > Thoughts?
>> >> > 
>> >> > How hard is this for filesystems to support?  Do they need an on-disk
>> >> > format change to keep track of the crash counter?  Maybe not, maybe the
>> >> > high bits of the i_version counters are all they need.
>> >> > 
>> >> 
>> >> Yeah, I imagine we'd need a on-disk change for this unless there's
>> >> something already present that we could use in place of a crash counter.
>> >
>> > We could consider using the current time instead.  So, put the current
>> > time (or time of last boot, or this inode's ctime, or something) in the
>> > high bits of the change attribute, and keep the low bits as a counter.
>> 
>> This is a very different proposal.
>> I don't think Jan was suggesting that the i_version be split into two
>> bit fields, one the change-counter and one the crash-counter.
>> Rather, the crash-counter was multiplied by a large-number and added to
>> the change-counter with the expectation that while not ever
>> change-counter landed on disk, at least 1 in every large-number would.
>> So after each crash we effectively add large-number to the
>> change-counter, and can be sure that number hasn't been used already.
>
> Yes, that was my thinking.
>
>> To store the crash-counter in each inode (which does appeal) you would
>> need to be able to remove it before adding the new crash counter, and
>> that requires bit-fields.  Maybe there are enough bits.
>
> Furthermore you'd have a potential problem that you need to change
> i_version on disk just because you are reading after a crash and such
> changes tend to be problematic (think of read-only mounts and stuff like
> that).
>  
>> If you want to ensure read-only files can remain cached over a crash,
>> then you would have to mark a file in some way on stable storage
>> *before* allowing any change.
>> e.g. you could use the lsb.  Odd i_versions might have been changed
>> recently and crash-count*large-number needs to be added.
>> Even i_versions have not been changed recently and nothing need be
>> added.
>> 
>> If you want to change a file with an even i_version, you subtract
>>   crash-count*large-number
>> to the i_version, then set lsb.  This is written to stable storage before
>> the change.
>> 
>> If a file has not been changed for a while, you can add
>>   crash-count*large-number
>> and clear lsb.
>> 
>> The lsb of the i_version would be for internal use only.  It would not
>> be visible outside the filesystem.
>> 
>> It feels a bit clunky, but I think it would work and is the best
>> combination of Jan's idea and your requirement.
>> The biggest cost would be switching to 'odd' before an changes, and the
>> unknown is when does it make sense to switch to 'even'.
>
> Well, there is also a problem that you would need to somehow remember with
> which 'crash count' the i_version has been previously reported as that is
> not stored on disk with my scheme. So I don't think we can easily use your
> scheme.

I don't think there is a problem here maybe I didn't explain
properly or something.

I'm assuming there is a crash-count that is stored once per filesystem.
This might be a disk-format change, or maybe the "Last checked" time
could be used with ext4 (that is a bit horrible though).

Every on-disk i_version has a flag to choose between:
  - use this number as it is, but update it on-disk before any change
  - add multiple of current crash-count to this number before use.
  If you crash during an update, the i_version is thus automatically
  increased.

To change from the first option to the second option you subtract the
multiple of the current crash-count (which might make the stored
i_version negative), and flip the bit.
To change from the second option to the first, you add the multiple
of the current crash-count, and flip the bit.
In each case, the externally visible i_version does not change.
Nothing needs to be stored except the per-inode i_version and the per-fs

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields

On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> 1) Keep i_version as is, make clients also check for i_ctime.

That would be a protocol revision, which we'd definitely rather avoid.

But can't we accomplish the same by using something like

ctime * (some constant) + i_version

?

>Pro: No on-disk format changes.
>Cons: After a crash, i_version can go backwards (but when file changes
>i_version, i_ctime pair should be still different) or not, data can be
>old or not.

This is probably good enough for NFS purposes: typically on an NFS
filesystem, results of a read in the face of a concurrent write open are
undefined.  And writers sync before close.

So after a crash with a dirty inode, we're in a situation where an NFS
client still needs to resend some writes, sync, and close.  I'm OK with
things being inconsistent during this window.

I do expect things to return to normal once that client's has resent its
writes--hence the worry about actually resuing old values after boot
(such as if i_version regresses on boot and then increments back to the
same value after further writes).  Factoring in ctime fixes that.

> 2) Fsync when reporting i_version.
>Pro: No on-disk format changes, strong consistency of i_version and
> data.
>Cons: Difficult to implement for filesystems due to locking constrains.
>  High performance overhead or i_version reporting.

Sounds painful.

> 3) Some variant of crash counter.
>Pro: i_version cannot go backwards.
>Cons: Requires on-disk format changes. After a crash data can be old
>  (however i_version increased).

Also, some unnecessary invalidations.  Which maybe can be mostly avoided
by some variation of Neil's scheme.

It looks to me like option (1) is doable now and introduces no
regressions compared to the current situation.  (2) and (3) are more
copmlicated and involve some tradeoffs.

Also, we can implement (1) and switch to (2) or (3) later.  We'd want to
do it without reported i_versions decreasing on kernel upgrade, but
there are multiple ways of handling that.  (Worst case, just restrict
the change to newly created filesystems.)

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields

On Wed, Apr 05, 2017 at 10:05:51AM +0200, Jan Kara wrote:
> 1) Keep i_version as is, make clients also check for i_ctime.

That would be a protocol revision, which we'd definitely rather avoid.

But can't we accomplish the same by using something like

ctime * (some constant) + i_version

?

>Pro: No on-disk format changes.
>Cons: After a crash, i_version can go backwards (but when file changes
>i_version, i_ctime pair should be still different) or not, data can be
>old or not.

This is probably good enough for NFS purposes: typically on an NFS
filesystem, results of a read in the face of a concurrent write open are
undefined.  And writers sync before close.

So after a crash with a dirty inode, we're in a situation where an NFS
client still needs to resend some writes, sync, and close.  I'm OK with
things being inconsistent during this window.

I do expect things to return to normal once that client's has resent its
writes--hence the worry about actually resuing old values after boot
(such as if i_version regresses on boot and then increments back to the
same value after further writes).  Factoring in ctime fixes that.

> 2) Fsync when reporting i_version.
>Pro: No on-disk format changes, strong consistency of i_version and
> data.
>Cons: Difficult to implement for filesystems due to locking constrains.
>  High performance overhead or i_version reporting.

Sounds painful.

> 3) Some variant of crash counter.
>Pro: i_version cannot go backwards.
>Cons: Requires on-disk format changes. After a crash data can be old
>  (however i_version increased).

Also, some unnecessary invalidations.  Which maybe can be mostly avoided
by some variation of Neil's scheme.

It looks to me like option (1) is doable now and introduces no
regressions compared to the current situation.  (2) and (3) are more
copmlicated and involve some tradeoffs.

Also, we can implement (1) and switch to (2) or (3) later.  We'd want to
do it without reported i_versions decreasing on kernel upgrade, but
there are multiple ways of handling that.  (Worst case, just restrict
the change to newly created filesystems.)

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields

On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

I was sort of ignoring the distinction between concatenate(A,B) and
A*m+B, but, sure, multiplying's probably better.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

i_version and the NFSv4 change attribute are 64 bits which gives us a
fair amount of flexibility.

> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

I'm not sure how to model the costs.  Something like that might work.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread J. Bruce Fields

On Wed, Apr 05, 2017 at 11:43:32AM +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

I was sort of ignoring the distinction between concatenate(A,B) and
A*m+B, but, sure, multiplying's probably better.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

i_version and the NFSv4 change attribute are 64 bits which gives us a
fair amount of flexibility.

> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

I'm not sure how to model the costs.  Something like that might work.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread Jan Kara

On Wed 05-04-17 11:43:32, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

Yes, that was my thinking.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

Furthermore you'd have a potential problem that you need to change
i_version on disk just because you are reading after a crash and such
changes tend to be problematic (think of read-only mounts and stuff like
that).
 
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

Well, there is also a problem that you would need to somehow remember with
which 'crash count' the i_version has been previously reported as that is
not stored on disk with my scheme. So I don't think we can easily use your
scheme.

So the options we have are:

1) Keep i_version as is, make clients also check for i_ctime.
   Pro: No on-disk format changes.
   Cons: After a crash, i_version can go backwards (but when file changes
   i_version, i_ctime pair should be still different) or not, data can be
   old or not.

2) Fsync when reporting i_version.
   Pro: No on-disk format changes, strong consistency of i_version and
data.
   Cons: Difficult to implement for filesystems due to locking constrains.
 High performance overhead or i_version reporting.

3) Some variant of crash counter.
   Pro: i_version cannot go backwards.
   Cons: Requires on-disk format changes. After a crash data can be old
 (however i_version increased).

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-05 Thread Jan Kara

On Wed 05-04-17 11:43:32, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to 
> >> > > > be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem 
> >> > > > shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
> >> > > > possible
> >> > > > i_version numbers we gave away but did not write to disk but 
> >> > > > still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

Yes, that was my thinking.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

Furthermore you'd have a potential problem that you need to change
i_version on disk just because you are reading after a crash and such
changes tend to be problematic (think of read-only mounts and stuff like
that).
 
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

Well, there is also a problem that you would need to somehow remember with
which 'crash count' the i_version has been previously reported as that is
not stored on disk with my scheme. So I don't think we can easily use your
scheme.

So the options we have are:

1) Keep i_version as is, make clients also check for i_ctime.
   Pro: No on-disk format changes.
   Cons: After a crash, i_version can go backwards (but when file changes
   i_version, i_ctime pair should be still different) or not, data can be
   old or not.

2) Fsync when reporting i_version.
   Pro: No on-disk format changes, strong consistency of i_version and
data.
   Cons: Difficult to implement for filesystems due to locking constrains.
 High performance overhead or i_version reporting.

3) Some variant of crash counter.
   Pro: i_version cannot go backwards.
   Cons: Requires on-disk format changes. After a crash data can be old
 (however i_version increased).

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, J. Bruce Fields wrote:

> On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > Because if above is acceptable we could make reported i_version to be 
>> > > > a sum
>> > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > shutdown.
>> > > > That way after a crash we are guaranteed each inode will report new
>> > > > i_version (the sum would probably have to look like "superblock crash
>> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > Thoughts?
>> > 
>> > How hard is this for filesystems to support?  Do they need an on-disk
>> > format change to keep track of the crash counter?  Maybe not, maybe the
>> > high bits of the i_version counters are all they need.
>> > 
>> 
>> Yeah, I imagine we'd need a on-disk change for this unless there's
>> something already present that we could use in place of a crash counter.
>
> We could consider using the current time instead.  So, put the current
> time (or time of last boot, or this inode's ctime, or something) in the
> high bits of the change attribute, and keep the low bits as a counter.

This is a very different proposal.
I don't think Jan was suggesting that the i_version be split into two
bit fields, one the change-counter and one the crash-counter.
Rather, the crash-counter was multiplied by a large-number and added to
the change-counter with the expectation that while not ever
change-counter landed on disk, at least 1 in every large-number would.
So after each crash we effectively add large-number to the
change-counter, and can be sure that number hasn't been used already.

To store the crash-counter in each inode (which does appeal) you would
need to be able to remove it before adding the new crash counter, and
that requires bit-fields.  Maybe there are enough bits.

If you want to ensure read-only files can remain cached over a crash,
then you would have to mark a file in some way on stable storage
*before* allowing any change.
e.g. you could use the lsb.  Odd i_versions might have been changed
recently and crash-count*large-number needs to be added.
Even i_versions have not been changed recently and nothing need be
added.

If you want to change a file with an even i_version, you subtract
  crash-count*large-number
to the i_version, then set lsb.  This is written to stable storage before
the change.

If a file has not been changed for a while, you can add
  crash-count*large-number
and clear lsb.

The lsb of the i_version would be for internal use only.  It would not
be visible outside the filesystem.

It feels a bit clunky, but I think it would work and is the best
combination of Jan's idea and your requirement.
The biggest cost would be switching to 'odd' before an changes, and the
unknown is when does it make sense to switch to 'even'.

NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, J. Bruce Fields wrote:

> On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
>> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
>> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > Because if above is acceptable we could make reported i_version to be 
>> > > > a sum
>> > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > shutdown.
>> > > > That way after a crash we are guaranteed each inode will report new
>> > > > i_version (the sum would probably have to look like "superblock crash
>> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
>> > > > i_version numbers we gave away but did not write to disk but still...).
>> > > > Thoughts?
>> > 
>> > How hard is this for filesystems to support?  Do they need an on-disk
>> > format change to keep track of the crash counter?  Maybe not, maybe the
>> > high bits of the i_version counters are all they need.
>> > 
>> 
>> Yeah, I imagine we'd need a on-disk change for this unless there's
>> something already present that we could use in place of a crash counter.
>
> We could consider using the current time instead.  So, put the current
> time (or time of last boot, or this inode's ctime, or something) in the
> high bits of the change attribute, and keep the low bits as a counter.

This is a very different proposal.
I don't think Jan was suggesting that the i_version be split into two
bit fields, one the change-counter and one the crash-counter.
Rather, the crash-counter was multiplied by a large-number and added to
the change-counter with the expectation that while not ever
change-counter landed on disk, at least 1 in every large-number would.
So after each crash we effectively add large-number to the
change-counter, and can be sure that number hasn't been used already.

To store the crash-counter in each inode (which does appeal) you would
need to be able to remove it before adding the new crash counter, and
that requires bit-fields.  Maybe there are enough bits.

If you want to ensure read-only files can remain cached over a crash,
then you would have to mark a file in some way on stable storage
*before* allowing any change.
e.g. you could use the lsb.  Odd i_versions might have been changed
recently and crash-count*large-number needs to be added.
Even i_versions have not been changed recently and nothing need be
added.

If you want to change a file with an even i_version, you subtract
  crash-count*large-number
to the i_version, then set lsb.  This is written to stable storage before
the change.

If a file has not been changed for a while, you can add
  crash-count*large-number
and clear lsb.

The lsb of the i_version would be for internal use only.  It would not
be visible outside the filesystem.

It feels a bit clunky, but I think it would work and is the best
combination of Jan's idea and your requirement.
The biggest cost would be switching to 'odd' before an changes, and the
unknown is when does it make sense to switch to 'even'.

NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Dave Chinner wrote:

> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
>> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
>> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
>> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > > Because if above is acceptable we could make reported i_version to 
>> > > > > be a sum
>> > > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > > shutdown.
>> > > > > That way after a crash we are guaranteed each inode will report new
>> > > > > i_version (the sum would probably have to look like "superblock crash
>> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> > > > > possible
>> > > > > i_version numbers we gave away but did not write to disk but 
>> > > > > still...).
>> > > > > Thoughts?
>> > > 
>> > > How hard is this for filesystems to support?  Do they need an on-disk
>> > > format change to keep track of the crash counter?
>> > 
>> > Yes. We'll need version counter in the superblock, and we'll need to
>> > know what the increment semantics are. 
>> > 
>> > The big question is how do we know there was a crash? The only thing
>> > a journalling filesystem knows at mount time is whether it is clean
>> > or requires recovery. Filesystems can require recovery for many
>> > reasons that don't involve a crash (e.g. root fs is never unmounted
>> > cleanly, so always requires recovery). Further, some filesystems may
>> > not even know there was a crash at mount time because their
>> > architecture always leaves a consistent filesystem on disk (e.g. COW
>> > filesystems)
>> 
>> What filesystems can or cannot easily do obviously differs. Ext4 has a
>> recovery flag set in superblock on RW mount/remount and cleared on
>> umount/RO remount.
>
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro. This uncovered a bug in grub in

Filesystems could use register_reboot_notifier() to get a notification
that even systemd cannot stuff-up.  It could check for dirty data and, if
there is none (which there shouldn't be if a sync happened), it does a
single write to disk to update the superblock (or a single write to each
disk... or something).
md does this, because getting the root device to be marked read-only is
even harder than getting the root filesystem to be remounted read-only.

NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Dave Chinner wrote:

> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
>> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
>> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
>> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
>> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
>> > > > > Because if above is acceptable we could make reported i_version to 
>> > > > > be a sum
>> > > > > of "superblock crash counter" and "inode i_version". We increment
>> > > > > "superblock crash counter" whenever we detect unclean filesystem 
>> > > > > shutdown.
>> > > > > That way after a crash we are guaranteed each inode will report new
>> > > > > i_version (the sum would probably have to look like "superblock crash
>> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing 
>> > > > > possible
>> > > > > i_version numbers we gave away but did not write to disk but 
>> > > > > still...).
>> > > > > Thoughts?
>> > > 
>> > > How hard is this for filesystems to support?  Do they need an on-disk
>> > > format change to keep track of the crash counter?
>> > 
>> > Yes. We'll need version counter in the superblock, and we'll need to
>> > know what the increment semantics are. 
>> > 
>> > The big question is how do we know there was a crash? The only thing
>> > a journalling filesystem knows at mount time is whether it is clean
>> > or requires recovery. Filesystems can require recovery for many
>> > reasons that don't involve a crash (e.g. root fs is never unmounted
>> > cleanly, so always requires recovery). Further, some filesystems may
>> > not even know there was a crash at mount time because their
>> > architecture always leaves a consistent filesystem on disk (e.g. COW
>> > filesystems)
>> 
>> What filesystems can or cannot easily do obviously differs. Ext4 has a
>> recovery flag set in superblock on RW mount/remount and cleared on
>> umount/RO remount.
>
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro. This uncovered a bug in grub in

Filesystems could use register_reboot_notifier() to get a notification
that even systemd cannot stuff-up.  It could check for dirty data and, if
there is none (which there shouldn't be if a sync happened), it does a
single write to disk to update the superblock (or a single write to each
disk... or something).
md does this, because getting the root device to be marked read-only is
even harder than getting the root filesystem to be remounted read-only.

NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields

On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
> 
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*).  IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.

A slightly out-of-date i_version is fine, I think.  It's just the
reverse we want to avoid.  E.g., assuming an i_version-supporting
statux, if somebody could called statx then read, and got the new
i_version followed by the old data, that would cause problems.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields

On Thu, Mar 30, 2017 at 10:41:37AM +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
> 
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*).  IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.

A slightly out-of-date i_version is fine, I think.  It's just the
reverse we want to avoid.  E.g., assuming an i_version-supporting
statux, if somebody could called statx then read, and got the new
i_version followed by the old data, that would cause problems.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields

On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a 
> > > > sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> > 
> > How hard is this for filesystems to support?  Do they need an on-disk
> > format change to keep track of the crash counter?  Maybe not, maybe the
> > high bits of the i_version counters are all they need.
> > 
> 
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.

We could consider using the current time instead.  So, put the current
time (or time of last boot, or this inode's ctime, or something) in the
high bits of the change attribute, and keep the low bits as a counter.

Then as long as we trust our clock, we're no longer at risk of reusing
an i_version value.

> > I guess we just want to have some back-of-the-envelope estimates of
> > maximum number of i_version increments possible between crashes and
> > maximum number of crashes possible over lifetime of a filesystem, to
> > decide how to split up the bits.
> > 
> > I wonder if we could get away with using the new crash counter only for
> > *new* values of the i_version?  After a crash, use the on disk i_version
> > as is, and put off using the new crash counter until the next time the
> > file's modified.
> > 
> 
> That sounds difficult to get right. Suppose I have an inode that has not
> been updated in a long time. Someone writes to it and then queries the
> i_version. How do I know whether there were crashes since the last time
> I updated it? Or am I misunderstanding what you're proposing here?

I believe Jan was suggesting that we keep the i_version as-is on disk
but combine with with the crash counter when it's queried.

I was suggesting instead that on write, when we bump the i_version, we
replace the simple increment by something that increments *and* sticks
the current crash counter (or maybe just a time) in the high bits.  And
that combination will be what goes in i_version and is written to disk.

That guarantees that we never reuse an old value when we increment.

It also avoids having to invalidate absolutely every cache, even of
completely static files.

Clients could still see the change attribute go backwards, though, if
they query a file with dirty data and the server crashes before it
writes out the new change attribute.

Also, they could fail to detect data that reverted after boot if they
cache data while the file's dirty on the server side, and the server
crash preserves the i_version update but not the new data.

Presumably both are possible already for ctime (depending on the
filesystem), and I'm not convinced they're a big problem.

In both cases the reading client is caching while somebody's still
writing, and as long as the writer comes back and finishes its job,
readers will thereafter see the right thing.  So I think it's adequate
for close-to-open.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields

On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a 
> > > > sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> > 
> > How hard is this for filesystems to support?  Do they need an on-disk
> > format change to keep track of the crash counter?  Maybe not, maybe the
> > high bits of the i_version counters are all they need.
> > 
> 
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.

We could consider using the current time instead.  So, put the current
time (or time of last boot, or this inode's ctime, or something) in the
high bits of the change attribute, and keep the low bits as a counter.

Then as long as we trust our clock, we're no longer at risk of reusing
an i_version value.

> > I guess we just want to have some back-of-the-envelope estimates of
> > maximum number of i_version increments possible between crashes and
> > maximum number of crashes possible over lifetime of a filesystem, to
> > decide how to split up the bits.
> > 
> > I wonder if we could get away with using the new crash counter only for
> > *new* values of the i_version?  After a crash, use the on disk i_version
> > as is, and put off using the new crash counter until the next time the
> > file's modified.
> > 
> 
> That sounds difficult to get right. Suppose I have an inode that has not
> been updated in a long time. Someone writes to it and then queries the
> i_version. How do I know whether there were crashes since the last time
> I updated it? Or am I misunderstanding what you're proposing here?

I believe Jan was suggesting that we keep the i_version as-is on disk
but combine with with the crash counter when it's queried.

I was suggesting instead that on write, when we bump the i_version, we
replace the simple increment by something that increments *and* sticks
the current crash counter (or maybe just a time) in the high bits.  And
that combination will be what goes in i_version and is written to disk.

That guarantees that we never reuse an old value when we increment.

It also avoids having to invalidate absolutely every cache, even of
completely static files.

Clients could still see the change attribute go backwards, though, if
they query a file with dirty data and the server crashes before it
writes out the new change attribute.

Also, they could fail to detect data that reverted after boot if they
cache data while the file's dirty on the server side, and the server
crash preserves the i_version update but not the new data.

Presumably both are possible already for ctime (depending on the
filesystem), and I'm not convinced they're a big problem.

In both cases the reading client is caching while somebody's still
writing, and as long as the writer comes back and finishes its job,
readers will thereafter see the right thing.  So I think it's adequate
for close-to-open.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields

On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote:
> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> > What filesystems can or cannot easily do obviously differs. Ext4 has a
> > recovery flag set in superblock on RW mount/remount and cleared on
> > umount/RO remount.
> 
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro.

I'd certainly rather not invalidate caches on *every* boot.

On the other hand, if the only cases involve the root filesystem, then
from the point of view of NFS, we probably don't care much.

> > This flag being set on mount would imply incrementing the crash
> > counter. It should be pretty easy for each filesystem to implement
> > such flag and the counter but I agree it requires an on-disk
> > format change.
> 
> Yup, anything we want that is persistent and consistent across
> filesystems will need on-disk format changes. Hence we need a solid
> specification first, not to mention tests to validate correct
> behaviour across all filesystems in xfstests...

For xfstests we'll need a way to get i_version (patch it into statx, I
guess?).  Ideally we'd have a way to test behavior across crashes too,
any advice there?

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread J. Bruce Fields

On Tue, Apr 04, 2017 at 10:34:14PM +1000, Dave Chinner wrote:
> On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> > What filesystems can or cannot easily do obviously differs. Ext4 has a
> > recovery flag set in superblock on RW mount/remount and cleared on
> > umount/RO remount.
> 
> Even this doesn't help. A recent bug that was reported to the XFS
> list - turns out that systemd can't remount-ro the root
> filesystem sucessfully on shutdown because there are open write fds
> on the root filesystem when it attempts the remount. So it just
> reboots without a remount-ro.

I'd certainly rather not invalidate caches on *every* boot.

On the other hand, if the only cases involve the root filesystem, then
from the point of view of NFS, we probably don't care much.

> > This flag being set on mount would imply incrementing the crash
> > counter. It should be pretty easy for each filesystem to implement
> > such flag and the counter but I agree it requires an on-disk
> > format change.
> 
> Yup, anything we want that is persistent and consistent across
> filesystems will need on-disk format changes. Hence we need a solid
> specification first, not to mention tests to validate correct
> behaviour across all filesystems in xfstests...

For xfstests we'll need a way to get i_version (patch it into statx, I
guess?).  Ideally we'd have a way to test behavior across crashes too,
any advice there?

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread Dave Chinner

On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > > Because if above is acceptable we could make reported i_version to be 
> > > > > a sum
> > > > > of "superblock crash counter" and "inode i_version". We increment
> > > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > > shutdown.
> > > > > That way after a crash we are guaranteed each inode will report new
> > > > > i_version (the sum would probably have to look like "superblock crash
> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > > i_version numbers we gave away but did not write to disk but 
> > > > > still...).
> > > > > Thoughts?
> > > 
> > > How hard is this for filesystems to support?  Do they need an on-disk
> > > format change to keep track of the crash counter?
> > 
> > Yes. We'll need version counter in the superblock, and we'll need to
> > know what the increment semantics are. 
> > 
> > The big question is how do we know there was a crash? The only thing
> > a journalling filesystem knows at mount time is whether it is clean
> > or requires recovery. Filesystems can require recovery for many
> > reasons that don't involve a crash (e.g. root fs is never unmounted
> > cleanly, so always requires recovery). Further, some filesystems may
> > not even know there was a crash at mount time because their
> > architecture always leaves a consistent filesystem on disk (e.g. COW
> > filesystems)
> 
> What filesystems can or cannot easily do obviously differs. Ext4 has a
> recovery flag set in superblock on RW mount/remount and cleared on
> umount/RO remount.

Even this doesn't help. A recent bug that was reported to the XFS
list - turns out that systemd can't remount-ro the root
filesystem sucessfully on shutdown because there are open write fds
on the root filesystem when it attempts the remount. So it just
reboots without a remount-ro. This uncovered a bug in grub in
that it (still!) thinks sync(1) is sufficient to get all the
metadata that points to a kernel image onto disk in places it can
read. XFS, like ext4, leaves it in the journal and so the system then fails to
boot because systemd didn't remount-ro the root fs and hence the
journal was never flushed before reboot and so grub can't find the
kernel and so everything fails

> This flag being set on mount would imply incrementing the crash
> counter. It should be pretty easy for each filesystem to implement
> such flag and the counter but I agree it requires an on-disk
> format change.

Yup, anything we want that is persistent and consistent across
filesystems will need on-disk format changes. Hence we need a solid
specification first, not to mention tests to validate correct
behaviour across all filesystems in xfstests...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-04 Thread Dave Chinner

On Mon, Apr 03, 2017 at 04:00:55PM +0200, Jan Kara wrote:
> On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> > On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > > Because if above is acceptable we could make reported i_version to be 
> > > > > a sum
> > > > > of "superblock crash counter" and "inode i_version". We increment
> > > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > > shutdown.
> > > > > That way after a crash we are guaranteed each inode will report new
> > > > > i_version (the sum would probably have to look like "superblock crash
> > > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > > i_version numbers we gave away but did not write to disk but 
> > > > > still...).
> > > > > Thoughts?
> > > 
> > > How hard is this for filesystems to support?  Do they need an on-disk
> > > format change to keep track of the crash counter?
> > 
> > Yes. We'll need version counter in the superblock, and we'll need to
> > know what the increment semantics are. 
> > 
> > The big question is how do we know there was a crash? The only thing
> > a journalling filesystem knows at mount time is whether it is clean
> > or requires recovery. Filesystems can require recovery for many
> > reasons that don't involve a crash (e.g. root fs is never unmounted
> > cleanly, so always requires recovery). Further, some filesystems may
> > not even know there was a crash at mount time because their
> > architecture always leaves a consistent filesystem on disk (e.g. COW
> > filesystems)
> 
> What filesystems can or cannot easily do obviously differs. Ext4 has a
> recovery flag set in superblock on RW mount/remount and cleared on
> umount/RO remount.

Even this doesn't help. A recent bug that was reported to the XFS
list - turns out that systemd can't remount-ro the root
filesystem sucessfully on shutdown because there are open write fds
on the root filesystem when it attempts the remount. So it just
reboots without a remount-ro. This uncovered a bug in grub in
that it (still!) thinks sync(1) is sufficient to get all the
metadata that points to a kernel image onto disk in places it can
read. XFS, like ext4, leaves it in the journal and so the system then fails to
boot because systemd didn't remount-ro the root fs and hence the
journal was never flushed before reboot and so grub can't find the
kernel and so everything fails

> This flag being set on mount would imply incrementing the crash
> counter. It should be pretty easy for each filesystem to implement
> such flag and the counter but I agree it requires an on-disk
> format change.

Yup, anything we want that is persistent and consistent across
filesystems will need on-disk format changes. Hence we need a solid
specification first, not to mention tests to validate correct
behaviour across all filesystems in xfstests...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-03 Thread Jan Kara

On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a 
> > > > sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> > 
> > How hard is this for filesystems to support?  Do they need an on-disk
> > format change to keep track of the crash counter?
> 
> Yes. We'll need version counter in the superblock, and we'll need to
> know what the increment semantics are. 
> 
> The big question is how do we know there was a crash? The only thing
> a journalling filesystem knows at mount time is whether it is clean
> or requires recovery. Filesystems can require recovery for many
> reasons that don't involve a crash (e.g. root fs is never unmounted
> cleanly, so always requires recovery). Further, some filesystems may
> not even know there was a crash at mount time because their
> architecture always leaves a consistent filesystem on disk (e.g. COW
> filesystems)

What filesystems can or cannot easily do obviously differs. Ext4 has a
recovery flag set in superblock on RW mount/remount and cleared on
umount/RO remount. This flag being set on mount would imply incrementing
the crash counter. It should be pretty easy for each filesystem to
implement such flag and the counter but I agree it requires an on-disk
format change.
 
Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-03 Thread Jan Kara

On Sun 02-04-17 09:05:26, Dave Chinner wrote:
> On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > > Because if above is acceptable we could make reported i_version to be a 
> > > > sum
> > > > of "superblock crash counter" and "inode i_version". We increment
> > > > "superblock crash counter" whenever we detect unclean filesystem 
> > > > shutdown.
> > > > That way after a crash we are guaranteed each inode will report new
> > > > i_version (the sum would probably have to look like "superblock crash
> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > > i_version numbers we gave away but did not write to disk but still...).
> > > > Thoughts?
> > 
> > How hard is this for filesystems to support?  Do they need an on-disk
> > format change to keep track of the crash counter?
> 
> Yes. We'll need version counter in the superblock, and we'll need to
> know what the increment semantics are. 
> 
> The big question is how do we know there was a crash? The only thing
> a journalling filesystem knows at mount time is whether it is clean
> or requires recovery. Filesystems can require recovery for many
> reasons that don't involve a crash (e.g. root fs is never unmounted
> cleanly, so always requires recovery). Further, some filesystems may
> not even know there was a crash at mount time because their
> architecture always leaves a consistent filesystem on disk (e.g. COW
> filesystems)

What filesystems can or cannot easily do obviously differs. Ext4 has a
recovery flag set in superblock on RW mount/remount and cleared on
umount/RO remount. This flag being set on mount would imply incrementing
the crash counter. It should be pretty easy for each filesystem to
implement such flag and the counter but I agree it requires an on-disk
format change.
 
Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-01 Thread Dave Chinner

On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Because if above is acceptable we could make reported i_version to be a 
> > > sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
> 
> How hard is this for filesystems to support?  Do they need an on-disk
> format change to keep track of the crash counter?

Yes. We'll need version counter in the superblock, and we'll need to
know what the increment semantics are. 

The big question is how do we know there was a crash? The only thing
a journalling filesystem knows at mount time is whether it is clean
or requires recovery. Filesystems can require recovery for many
reasons that don't involve a crash (e.g. root fs is never unmounted
cleanly, so always requires recovery). Further, some filesystems may
not even know there was a crash at mount time because their
architecture always leaves a consistent filesystem on disk (e.g. COW
filesystems)

> I wonder if repeated crashes can lead to any odd corner cases.

WIthout defined, locked down behavour of the superblock counter, the
almost certainly corner cases will exist...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-04-01 Thread Dave Chinner

On Thu, Mar 30, 2017 at 12:12:31PM -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Because if above is acceptable we could make reported i_version to be a 
> > > sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
> 
> How hard is this for filesystems to support?  Do they need an on-disk
> format change to keep track of the crash counter?

Yes. We'll need version counter in the superblock, and we'll need to
know what the increment semantics are. 

The big question is how do we know there was a crash? The only thing
a journalling filesystem knows at mount time is whether it is clean
or requires recovery. Filesystems can require recovery for many
reasons that don't involve a crash (e.g. root fs is never unmounted
cleanly, so always requires recovery). Further, some filesystems may
not even know there was a crash at mount time because their
architecture always leaves a consistent filesystem on disk (e.g. COW
filesystems)

> I wonder if repeated crashes can lead to any odd corner cases.

WIthout defined, locked down behavour of the superblock counter, the
almost certainly corner cases will exist...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Boaz Harrosh

On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
> 

Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount 
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.

But this will need some math foo to get the bits compacted correctly
just a thought.

Thanks
Boaz

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Boaz Harrosh

On 03/30/2017 09:35 PM, Jeff Layton wrote:
<>
> Yeah, I imagine we'd need a on-disk change for this unless there's
> something already present that we could use in place of a crash counter.
> 

Perhaps we can use s_mtime and/or s_wtime in some way, I'm not sure
what is a parallel for that in xfs.
s_mtime - time-of-last mount 
s_wtime - time-of-last mount, umount, freez, unfreez, remount, ...
Of course you'll need a per FS vector to access that.

But this will need some math foo to get the bits compacted correctly
just a thought.

Thanks
Boaz

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jeff Layton

On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Hum, so are we fine if i_version just changes (increases) for all inodes
> > > after a server crash? If I understand its use right, it would mean
> > > invalidation of all client's caches but that is not such a big deal given
> > > how frequent server crashes should be, right?
> 
> Even if it's rare, it may be really painful when all your clients are
> forced to throw out and repopulate their caches after a crash.  But,
> yes, maybe we can live with it.
> 

Yeah, assuming that normal reboots wouldn't cause this, then I don't see
it as being too bad.

> > > Because if above is acceptable we could make reported i_version to be a 
> > > sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
> 
> How hard is this for filesystems to support?  Do they need an on-disk
> format change to keep track of the crash counter?  Maybe not, maybe the
> high bits of the i_version counters are all they need.
> 

Yeah, I imagine we'd need a on-disk change for this unless there's
something already present that we could use in place of a crash counter.

> > That does sound like a good idea. This is a 64 bit value, so we should
> > be able to carve out some upper bits for a crash counter without risking
> > wrapping.
> > 
> > The other constraint here is that we'd like any later version of the
> > counter to be larger than any earlier value that was handed out. I think
> > this idea would still satisfy that.
> 
> I guess we just want to have some back-of-the-envelope estimates of
> maximum number of i_version increments possible between crashes and
> maximum number of crashes possible over lifetime of a filesystem, to
> decide how to split up the bits.
> 
> I wonder if we could get away with using the new crash counter only for
> *new* values of the i_version?  After a crash, use the on disk i_version
> as is, and put off using the new crash counter until the next time the
> file's modified.
> 

That sounds difficult to get right. Suppose I have an inode that has not
been updated in a long time. Someone writes to it and then queries the
i_version. How do I know whether there were crashes since the last time
I updated it? Or am I misunderstanding what you're proposing here?

> That would still eliminate the risk of accidental reuse of an old
> i_version value.  It still leaves some cases where the client could fail
> to notice an update indefinitely.  All these cases I think have to
> assume that a writer made some changes that it failed to ever sync, so
> as long as we care only about close-to-open semantics perhaps those
> cases don't matter.
> 
> I wonder if repeated crashes can lead to any odd corner cases.
> 

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jeff Layton

On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > > Hum, so are we fine if i_version just changes (increases) for all inodes
> > > after a server crash? If I understand its use right, it would mean
> > > invalidation of all client's caches but that is not such a big deal given
> > > how frequent server crashes should be, right?
> 
> Even if it's rare, it may be really painful when all your clients are
> forced to throw out and repopulate their caches after a crash.  But,
> yes, maybe we can live with it.
> 

Yeah, assuming that normal reboots wouldn't cause this, then I don't see
it as being too bad.

> > > Because if above is acceptable we could make reported i_version to be a 
> > > sum
> > > of "superblock crash counter" and "inode i_version". We increment
> > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > > That way after a crash we are guaranteed each inode will report new
> > > i_version (the sum would probably have to look like "superblock crash
> > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > > i_version numbers we gave away but did not write to disk but still...).
> > > Thoughts?
> 
> How hard is this for filesystems to support?  Do they need an on-disk
> format change to keep track of the crash counter?  Maybe not, maybe the
> high bits of the i_version counters are all they need.
> 

Yeah, I imagine we'd need a on-disk change for this unless there's
something already present that we could use in place of a crash counter.

> > That does sound like a good idea. This is a 64 bit value, so we should
> > be able to carve out some upper bits for a crash counter without risking
> > wrapping.
> > 
> > The other constraint here is that we'd like any later version of the
> > counter to be larger than any earlier value that was handed out. I think
> > this idea would still satisfy that.
> 
> I guess we just want to have some back-of-the-envelope estimates of
> maximum number of i_version increments possible between crashes and
> maximum number of crashes possible over lifetime of a filesystem, to
> decide how to split up the bits.
> 
> I wonder if we could get away with using the new crash counter only for
> *new* values of the i_version?  After a crash, use the on disk i_version
> as is, and put off using the new crash counter until the next time the
> file's modified.
> 

That sounds difficult to get right. Suppose I have an inode that has not
been updated in a long time. Someone writes to it and then queries the
i_version. How do I know whether there were crashes since the last time
I updated it? Or am I misunderstanding what you're proposing here?

> That would still eliminate the risk of accidental reuse of an old
> i_version value.  It still leaves some cases where the client could fail
> to notice an update indefinitely.  All these cases I think have to
> assume that a writer made some changes that it failed to ever sync, so
> as long as we care only about close-to-open semantics perhaps those
> cases don't matter.
> 
> I wonder if repeated crashes can lead to any odd corner cases.
> 

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread J. Bruce Fields

On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > Hum, so are we fine if i_version just changes (increases) for all inodes
> > after a server crash? If I understand its use right, it would mean
> > invalidation of all client's caches but that is not such a big deal given
> > how frequent server crashes should be, right?

Even if it's rare, it may be really painful when all your clients are
forced to throw out and repopulate their caches after a crash.  But,
yes, maybe we can live with it.

> > Because if above is acceptable we could make reported i_version to be a sum
> > of "superblock crash counter" and "inode i_version". We increment
> > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > That way after a crash we are guaranteed each inode will report new
> > i_version (the sum would probably have to look like "superblock crash
> > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > i_version numbers we gave away but did not write to disk but still...).
> > Thoughts?

How hard is this for filesystems to support?  Do they need an on-disk
format change to keep track of the crash counter?  Maybe not, maybe the
high bits of the i_version counters are all they need.

> That does sound like a good idea. This is a 64 bit value, so we should
> be able to carve out some upper bits for a crash counter without risking
> wrapping.
> 
> The other constraint here is that we'd like any later version of the
> counter to be larger than any earlier value that was handed out. I think
> this idea would still satisfy that.

I guess we just want to have some back-of-the-envelope estimates of
maximum number of i_version increments possible between crashes and
maximum number of crashes possible over lifetime of a filesystem, to
decide how to split up the bits.

I wonder if we could get away with using the new crash counter only for
*new* values of the i_version?  After a crash, use the on disk i_version
as is, and put off using the new crash counter until the next time the
file's modified.

That would still eliminate the risk of accidental reuse of an old
i_version value.  It still leaves some cases where the client could fail
to notice an update indefinitely.  All these cases I think have to
assume that a writer made some changes that it failed to ever sync, so
as long as we care only about close-to-open semantics perhaps those
cases don't matter.

I wonder if repeated crashes can lead to any odd corner cases.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread J. Bruce Fields

On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> > Hum, so are we fine if i_version just changes (increases) for all inodes
> > after a server crash? If I understand its use right, it would mean
> > invalidation of all client's caches but that is not such a big deal given
> > how frequent server crashes should be, right?

Even if it's rare, it may be really painful when all your clients are
forced to throw out and repopulate their caches after a crash.  But,
yes, maybe we can live with it.

> > Because if above is acceptable we could make reported i_version to be a sum
> > of "superblock crash counter" and "inode i_version". We increment
> > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> > That way after a crash we are guaranteed each inode will report new
> > i_version (the sum would probably have to look like "superblock crash
> > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> > i_version numbers we gave away but did not write to disk but still...).
> > Thoughts?

How hard is this for filesystems to support?  Do they need an on-disk
format change to keep track of the crash counter?  Maybe not, maybe the
high bits of the i_version counters are all they need.

> That does sound like a good idea. This is a 64 bit value, so we should
> be able to carve out some upper bits for a crash counter without risking
> wrapping.
> 
> The other constraint here is that we'd like any later version of the
> counter to be larger than any earlier value that was handed out. I think
> this idea would still satisfy that.

I guess we just want to have some back-of-the-envelope estimates of
maximum number of i_version increments possible between crashes and
maximum number of crashes possible over lifetime of a filesystem, to
decide how to split up the bits.

I wonder if we could get away with using the new crash counter only for
*new* values of the i_version?  After a crash, use the on disk i_version
as is, and put off using the new crash counter until the next time the
file's modified.

That would still eliminate the risk of accidental reuse of an old
i_version value.  It still leaves some cases where the client could fail
to notice an update indefinitely.  All these cases I think have to
assume that a writer made some changes that it failed to ever sync, so
as long as we care only about close-to-open semantics perhaps those
cases don't matter.

I wonder if repeated crashes can lead to any odd corner cases.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jeff Layton

On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
> 
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*).  IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.
> 
> Cheers,
> 
> Dave.
> 
> (*) fiemap has exactly the same "stale the moment internal fs
> locks are released" race conditions, which is why it cannot safely
> be used for mapping holes when copying file data
> 

FWIW, I'm not terribly worried about atomicity for all of the reasons
you describe. My main concern is reusing an i_version value that has
already been handed out when the inode is now in an entirely different
state. To that end, I was only considering doing an fsync iff
STATX_VERSION was requested. If it wasn't we wouldn't need to do one.

A lot of ->getattr implementations already call filemap_write_and_wait,
but you're correct that flushing out the metadata is different matter. 
It'd be nice to avoid that if we can.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jeff Layton

On Thu, 2017-03-30 at 10:41 +1100, Dave Chinner wrote:
> On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> I think that's even more problematic. ->getattr currently runs
> completely unlocked for performance reasons - it's racy w.r.t. to
> ongoing modifications to begin with, so /nothing/ that is returned
> to userspace via stat/statx can be guaranteed to be "coherent".
> Linus will be very unhappy if you make his git workload (which is
> /very/ stat heavy) run slower by adding any sort of locking in this
> hot path.
> 
> Even if we did put an fsync() into ->getattr() (and dealt with all
> the locking issues that entails), by the time the statx syscall
> returns to userspace the i_version value may not match the
> data/metadata in the inode(*).  IOWs, by the time i_version gets
> to userspace, it is out of date and any use of it for data
> versioning from userspace is going to be prone to race conditions.
> 
> Cheers,
> 
> Dave.
> 
> (*) fiemap has exactly the same "stale the moment internal fs
> locks are released" race conditions, which is why it cannot safely
> be used for mapping holes when copying file data
> 

FWIW, I'm not terribly worried about atomicity for all of the reasons
you describe. My main concern is reusing an i_version value that has
already been handed out when the inode is now in an entirely different
state. To that end, I was only considering doing an fsync iff
STATX_VERSION was requested. If it wasn't we wouldn't need to do one.

A lot of ->getattr implementations already call filemap_write_and_wait,
but you're correct that flushing out the metadata is different matter. 
It'd be nice to avoid that if we can.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jeff Layton

On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> On Wed 29-03-17 13:54:31, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> Hum, so are we fine if i_version just changes (increases) for all inodes
> after a server crash? If I understand its use right, it would mean
> invalidation of all client's caches but that is not such a big deal given
> how frequent server crashes should be, right?
> 
> Because if above is acceptable we could make reported i_version to be a sum
> of "superblock crash counter" and "inode i_version". We increment
> "superblock crash counter" whenever we detect unclean filesystem shutdown.
> That way after a crash we are guaranteed each inode will report new
> i_version (the sum would probably have to look like "superblock crash
> counter" * 65536 + "inode i_version" so that we avoid reusing possible
> i_version numbers we gave away but did not write to disk but still...).
> Thoughts?
> 

That does sound like a good idea. This is a 64 bit value, so we should
be able to carve out some upper bits for a crash counter without risking
wrapping.

The other constraint here is that we'd like any later version of the
counter to be larger than any earlier value that was handed out. I think
this idea would still satisfy that.
-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jeff Layton

On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> On Wed 29-03-17 13:54:31, Jeff Layton wrote:
> > On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > > - It's durable; the above comparison still works if there were 
> > > > > > > reboots
> > > > > > >   between the two i_version checks.
> > > > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > > > atomically with the changes that caused them?  What if a
> > > > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > > > it to disk, and then that value is reused after reboot?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, there could be atomicity there. If we bump i_version, we'll 
> > > > > > mark
> > > > > > the inode dirty and I think that will end up with the new i_version 
> > > > > > at
> > > > > > least being journalled before __mark_inode_dirty returns.
> > > > > 
> > > > > So you think the filesystem can provide the atomicity?  In more 
> > > > > detail:
> > > > > 
> > > > 
> > > > Sorry, I hit send too quickly. That should have read:
> > > > 
> > > > "Yeah, there could be atomicity issues there."
> > > > 
> > > > I think providing that level of atomicity may be difficult, though
> > > > maybe there's some way to make the querying of i_version block until
> > > > the inode update has been journalled?
> > > 
> > > Just to complement what Dave said from ext4 side - similarly as with XFS
> > > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > > Until that you can see arbitrary combination of data & i_version after the
> > > crash. We do take care to keep data and metadata in sync only when there
> > > are security implications to that (like exposing uninitialized disk 
> > > blocks)
> > > and if not, we are as lazy as we can to improve performance...
> > > 
> > > 
> > 
> > Yeah, I think what we'll have to do here is ensure that those
> > filesystems do an fsync prior to reporting the i_version getattr
> > codepath. It's not pretty, but I don't see a real alternative.
> 
> Hum, so are we fine if i_version just changes (increases) for all inodes
> after a server crash? If I understand its use right, it would mean
> invalidation of all client's caches but that is not such a big deal given
> how frequent server crashes should be, right?
> 
> Because if above is acceptable we could make reported i_version to be a sum
> of "superblock crash counter" and "inode i_version". We increment
> "superblock crash counter" whenever we detect unclean filesystem shutdown.
> That way after a crash we are guaranteed each inode will report new
> i_version (the sum would probably have to look like "superblock crash
> counter" * 65536 + "inode i_version" so that we avoid reusing possible
> i_version numbers we gave away but did not write to disk but still...).
> Thoughts?
> 

That does sound like a good idea. This is a 64 bit value, so we should
be able to carve out some upper bits for a crash counter without risking
wrapping.

The other constraint here is that we'd like any later version of the
counter to be larger than any earlier value that was handed out. I think
this idea would still satisfy that.
-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jan Kara

On Wed 29-03-17 13:54:31, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were 
> > > > > > reboots
> > > > > >   between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > >   if there's a weaker guarantee that's still useful.  Do
> > > > > >   filesystems actually make ctime/mtime/i_version changes
> > > > > >   atomically with the changes that caused them?  What if a
> > > > > >   change attribute is exposed to an NFS client but doesn't make
> > > > > >   it to disk, and then that value is reused after reboot?
> > > > > > 
> > > > > 
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > > 
> > > > So you think the filesystem can provide the atomicity?  In more detail:
> > > > 
> > > 
> > > Sorry, I hit send too quickly. That should have read:
> > > 
> > > "Yeah, there could be atomicity issues there."
> > > 
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> > 
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> > 
> > 
> 
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.

Hum, so are we fine if i_version just changes (increases) for all inodes
after a server crash? If I understand its use right, it would mean
invalidation of all client's caches but that is not such a big deal given
how frequent server crashes should be, right?

Because if above is acceptable we could make reported i_version to be a sum
of "superblock crash counter" and "inode i_version". We increment
"superblock crash counter" whenever we detect unclean filesystem shutdown.
That way after a crash we are guaranteed each inode will report new
i_version (the sum would probably have to look like "superblock crash
counter" * 65536 + "inode i_version" so that we avoid reusing possible
i_version numbers we gave away but did not write to disk but still...).
Thoughts?

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-30 Thread Jan Kara

On Wed 29-03-17 13:54:31, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were 
> > > > > > reboots
> > > > > >   between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > >   if there's a weaker guarantee that's still useful.  Do
> > > > > >   filesystems actually make ctime/mtime/i_version changes
> > > > > >   atomically with the changes that caused them?  What if a
> > > > > >   change attribute is exposed to an NFS client but doesn't make
> > > > > >   it to disk, and then that value is reused after reboot?
> > > > > > 
> > > > > 
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > > 
> > > > So you think the filesystem can provide the atomicity?  In more detail:
> > > > 
> > > 
> > > Sorry, I hit send too quickly. That should have read:
> > > 
> > > "Yeah, there could be atomicity issues there."
> > > 
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> > 
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> > 
> > 
> 
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.

Hum, so are we fine if i_version just changes (increases) for all inodes
after a server crash? If I understand its use right, it would mean
invalidation of all client's caches but that is not such a big deal given
how frequent server crashes should be, right?

Because if above is acceptable we could make reported i_version to be a sum
of "superblock crash counter" and "inode i_version". We increment
"superblock crash counter" whenever we detect unclean filesystem shutdown.
That way after a crash we are guaranteed each inode will report new
i_version (the sum would probably have to look like "superblock crash
counter" * 65536 + "inode i_version" so that we avoid reusing possible
i_version numbers we gave away but did not write to disk but still...).
Thoughts?

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Dave Chinner

On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were 
> > > > > > reboots
> > > > > >   between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > >   if there's a weaker guarantee that's still useful.  Do
> > > > > >   filesystems actually make ctime/mtime/i_version changes
> > > > > >   atomically with the changes that caused them?  What if a
> > > > > >   change attribute is exposed to an NFS client but doesn't make
> > > > > >   it to disk, and then that value is reused after reboot?
> > > > > > 
> > > > > 
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > > 
> > > > So you think the filesystem can provide the atomicity?  In more detail:
> > > > 
> > > 
> > > Sorry, I hit send too quickly. That should have read:
> > > 
> > > "Yeah, there could be atomicity issues there."
> > > 
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> > 
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> > 
> > 
> 
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.

I think that's even more problematic. ->getattr currently runs
completely unlocked for performance reasons - it's racy w.r.t. to
ongoing modifications to begin with, so /nothing/ that is returned
to userspace via stat/statx can be guaranteed to be "coherent".
Linus will be very unhappy if you make his git workload (which is
/very/ stat heavy) run slower by adding any sort of locking in this
hot path.

Even if we did put an fsync() into ->getattr() (and dealt with all
the locking issues that entails), by the time the statx syscall
returns to userspace the i_version value may not match the
data/metadata in the inode(*).  IOWs, by the time i_version gets
to userspace, it is out of date and any use of it for data
versioning from userspace is going to be prone to race conditions.

Cheers,

Dave.

(*) fiemap has exactly the same "stale the moment internal fs
locks are released" race conditions, which is why it cannot safely
be used for mapping holes when copying file data

-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Dave Chinner

On Wed, Mar 29, 2017 at 01:54:31PM -0400, Jeff Layton wrote:
> On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> > On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > > - It's durable; the above comparison still works if there were 
> > > > > > reboots
> > > > > >   between the two i_version checks.
> > > > > > - I don't know how realistic this is--we may need to figure out
> > > > > >   if there's a weaker guarantee that's still useful.  Do
> > > > > >   filesystems actually make ctime/mtime/i_version changes
> > > > > >   atomically with the changes that caused them?  What if a
> > > > > >   change attribute is exposed to an NFS client but doesn't make
> > > > > >   it to disk, and then that value is reused after reboot?
> > > > > > 
> > > > > 
> > > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > > the inode dirty and I think that will end up with the new i_version at
> > > > > least being journalled before __mark_inode_dirty returns.
> > > > 
> > > > So you think the filesystem can provide the atomicity?  In more detail:
> > > > 
> > > 
> > > Sorry, I hit send too quickly. That should have read:
> > > 
> > > "Yeah, there could be atomicity issues there."
> > > 
> > > I think providing that level of atomicity may be difficult, though
> > > maybe there's some way to make the querying of i_version block until
> > > the inode update has been journalled?
> > 
> > Just to complement what Dave said from ext4 side - similarly as with XFS
> > ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> > Until that you can see arbitrary combination of data & i_version after the
> > crash. We do take care to keep data and metadata in sync only when there
> > are security implications to that (like exposing uninitialized disk blocks)
> > and if not, we are as lazy as we can to improve performance...
> > 
> > 
> 
> Yeah, I think what we'll have to do here is ensure that those
> filesystems do an fsync prior to reporting the i_version getattr
> codepath. It's not pretty, but I don't see a real alternative.

I think that's even more problematic. ->getattr currently runs
completely unlocked for performance reasons - it's racy w.r.t. to
ongoing modifications to begin with, so /nothing/ that is returned
to userspace via stat/statx can be guaranteed to be "coherent".
Linus will be very unhappy if you make his git workload (which is
/very/ stat heavy) run slower by adding any sort of locking in this
hot path.

Even if we did put an fsync() into ->getattr() (and dealt with all
the locking issues that entails), by the time the statx syscall
returns to userspace the i_version value may not match the
data/metadata in the inode(*).  IOWs, by the time i_version gets
to userspace, it is out of date and any use of it for data
versioning from userspace is going to be prone to race conditions.

Cheers,

Dave.

(*) fiemap has exactly the same "stale the moment internal fs
locks are released" race conditions, which is why it cannot safely
be used for mapping holes when copying file data

-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Jeff Layton

On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > >   between the two i_version checks.
> > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them?  What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > > 
> > > > 
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > > 
> > > So you think the filesystem can provide the atomicity?  In more detail:
> > > 
> > 
> > Sorry, I hit send too quickly. That should have read:
> > 
> > "Yeah, there could be atomicity issues there."
> > 
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
> 
> Just to complement what Dave said from ext4 side - similarly as with XFS
> ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> Until that you can see arbitrary combination of data & i_version after the
> crash. We do take care to keep data and metadata in sync only when there
> are security implications to that (like exposing uninitialized disk blocks)
> and if not, we are as lazy as we can to improve performance...
> 
> 

Yeah, I think what we'll have to do here is ensure that those
filesystems do an fsync prior to reporting the i_version getattr
codepath. It's not pretty, but I don't see a real alternative.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Jeff Layton

On Wed, 2017-03-29 at 13:15 +0200, Jan Kara wrote:
> On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > >   between the two i_version checks.
> > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them?  What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > > 
> > > > 
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > > 
> > > So you think the filesystem can provide the atomicity?  In more detail:
> > > 
> > 
> > Sorry, I hit send too quickly. That should have read:
> > 
> > "Yeah, there could be atomicity issues there."
> > 
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
> 
> Just to complement what Dave said from ext4 side - similarly as with XFS
> ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
> Until that you can see arbitrary combination of data & i_version after the
> crash. We do take care to keep data and metadata in sync only when there
> are security implications to that (like exposing uninitialized disk blocks)
> and if not, we are as lazy as we can to improve performance...
> 
> 

Yeah, I think what we'll have to do here is ensure that those
filesystems do an fsync prior to reporting the i_version getattr
codepath. It's not pretty, but I don't see a real alternative.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Jan Kara

On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > >   if there's a weaker guarantee that's still useful.  Do
> > > >   filesystems actually make ctime/mtime/i_version changes
> > > >   atomically with the changes that caused them?  What if a
> > > >   change attribute is exposed to an NFS client but doesn't make
> > > >   it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

Just to complement what Dave said from ext4 side - similarly as with XFS
ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
Until that you can see arbitrary combination of data & i_version after the
crash. We do take care to keep data and metadata in sync only when there
are security implications to that (like exposing uninitialized disk blocks)
and if not, we are as lazy as we can to improve performance...

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-29 Thread Jan Kara

On Tue 21-03-17 14:46:53, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > >   if there's a weaker guarantee that's still useful.  Do
> > > >   filesystems actually make ctime/mtime/i_version changes
> > > >   atomically with the changes that caused them?  What if a
> > > >   change attribute is exposed to an NFS client but doesn't make
> > > >   it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

Just to complement what Dave said from ext4 side - similarly as with XFS
ext4 doesn't guarantee atomicity unless fsync() has completed on the file.
Until that you can see arbitrary combination of data & i_version after the
crash. We do take care to keep data and metadata in sync only when there
are security implications to that (like exposing uninitialized disk blocks)
and if not, we are as lazy as we can to improve performance...

Honza
-- 
Jan Kara 
SUSE Labs, CR

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-22 Thread Jeff Layton

On Wed, 2017-03-22 at 08:45 +1100, Dave Chinner wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > >   between the two i_version checks.
> > >   - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful.  Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them?  What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > > 
> > 
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
> 
> The change may be journalled, but it isn't guaranteed stable until
> fsync is run on the inode.
> 
> NFS server operations commit the metadata changed by a modification
> through ->commit_metadata or sync_inode_metadata() before the
> response is sent back to the client, hence guaranteeing that
> i_version changes through the NFS server are stable and durable.
> 
> This is not the case for normal operations done through the POSIX
> API - the journalling is asynchronous and the only durability
> guarantees are provided by fsync()
> 

Ahh ok, I missed that...thanks.

I think we'll have a hard time making this fully atomic. We may end up
having to settle for something less (and doing our best to warn users
of that possibility).

One idea might be to tie the behavior to AT_FORCE/DONT_SYNC. In the
don't sync case, allow the kernel to hand out the i_version without
syncing it to disk. In the FORCE_SYNC case, do an fsync internally
before returning.

> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
> 
> Yup, this has aways been a problem when you mix posix applications
> running on the NFS server modifying the same files as the NFS
> clients are accessing and requiring synchronisation.
> 
> > Not sure how big a problem that really is.
> 
> This coherency problem has always existed on the server side...
> 

Yes. I don't think this patchset makes anything worse in this regard.
We will need well-defined semantics here before i_version can be
exposed to userland via statx however.
-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-22 Thread Jeff Layton

On Wed, 2017-03-22 at 08:45 +1100, Dave Chinner wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > >   between the two i_version checks.
> > >   - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful.  Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them?  What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > > 
> > 
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
> 
> The change may be journalled, but it isn't guaranteed stable until
> fsync is run on the inode.
> 
> NFS server operations commit the metadata changed by a modification
> through ->commit_metadata or sync_inode_metadata() before the
> response is sent back to the client, hence guaranteeing that
> i_version changes through the NFS server are stable and durable.
> 
> This is not the case for normal operations done through the POSIX
> API - the journalling is asynchronous and the only durability
> guarantees are provided by fsync()
> 

Ahh ok, I missed that...thanks.

I think we'll have a hard time making this fully atomic. We may end up
having to settle for something less (and doing our best to warn users
of that possibility).

One idea might be to tie the behavior to AT_FORCE/DONT_SYNC. In the
don't sync case, allow the kernel to hand out the i_version without
syncing it to disk. In the FORCE_SYNC case, do an fsync internally
before returning.

> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
> 
> Yup, this has aways been a problem when you mix posix applications
> running on the NFS server modifying the same files as the NFS
> clients are accessing and requiring synchronisation.
> 
> > Not sure how big a problem that really is.
> 
> This coherency problem has always existed on the server side...
> 

Yes. I don't think this patchset makes anything worse in this regard.
We will need well-defined semantics here before i_version can be
exposed to userland via statx however.
-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton

On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > >   between the two i_version checks.
> > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them?  What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > > 
> > > > 
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > > 
> > > So you think the filesystem can provide the atomicity?  In more detail:
> > > 
> > 
> > Sorry, I hit send too quickly. That should have read:
> > 
> > "Yeah, there could be atomicity issues there."
> > 
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
> 
> No idea.  Anyway, I'd like to figure out some reasonable requirement
> that we can document.
> 
> > 
> > >   - if I write to a file, a simultaneous reader should see either
> > > (old data, old i_version) or (new data, new i_version), not a
> > > combination of the two.
> > >   - ditto for metadata modifications.
> > >   - the same should be true if there's a crash.
> > > 
> > > (If that's not possible, then I think we could live with a brief window
> > > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > > 
> > > > That said, I suppose it is possible for us to bump the counter, hand
> > > > that new counter value out to a NFS client and then the box crashes
> > > > before it makes it to the journal.
> > > > 
> > > > Not sure how big a problem that really is.
> > > 
> > > The other case I was wondering about may have been unclear.  Represent
> > > the state of a file by a (data, i_version) pair.  Say:
> > > 
> > >   - file is modified from (F, V) to (F', V+1).
> > >   - client reads and caches (F', V+1).
> > >   - server crashes before writeback, so disk still has (F, V).
> > >   - after restart, someone else modifies file to (F'', V+1).
> > >   - original client revalidates its cache, sees V+1, concludes
> > > file data is still F'.
> > > 
> > > This may not cause a real problem for clients depending only on
> > > traditional NFS close-to-open semantics.
> > > 
> > > 
> > 
> > No, I think that is a legitimate problem.
> > 
> > That said, after F'', the mtime would almost certainly be different
> > from the time after F', and that would likely be enough to prevent
> > confusion in NFS clients.
> 
> Oh, good point.  So, may be worth saying that anyone wanting to make
> sense of these across reboot should compare times as well (maybe that
> should be in nfs rfc's too).  I think that should be ctime not mtime,
> though?
> 

Yes, it might be worth a mention there. IIRC, it does mention that you
shouldn't just look at a single attribute for cache validation
purposes, but the wording is a bit vague. I can't find the section at
the moment though.

The more I think about it though, simply ensuring that we don't publish
 a new change attr until the inode update has hit the journal may be
the best we can do. I'd have to think about how to implement that
though.
-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton

On Tue, 2017-03-21 at 15:13 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > > - It's durable; the above comparison still works if there were reboots
> > > > >   between the two i_version checks.
> > > > >   - I don't know how realistic this is--we may need to figure out
> > > > > if there's a weaker guarantee that's still useful.  Do
> > > > > filesystems actually make ctime/mtime/i_version changes
> > > > > atomically with the changes that caused them?  What if a
> > > > > change attribute is exposed to an NFS client but doesn't make
> > > > > it to disk, and then that value is reused after reboot?
> > > > > 
> > > > 
> > > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > > the inode dirty and I think that will end up with the new i_version at
> > > > least being journalled before __mark_inode_dirty returns.
> > > 
> > > So you think the filesystem can provide the atomicity?  In more detail:
> > > 
> > 
> > Sorry, I hit send too quickly. That should have read:
> > 
> > "Yeah, there could be atomicity issues there."
> > 
> > I think providing that level of atomicity may be difficult, though
> > maybe there's some way to make the querying of i_version block until
> > the inode update has been journalled?
> 
> No idea.  Anyway, I'd like to figure out some reasonable requirement
> that we can document.
> 
> > 
> > >   - if I write to a file, a simultaneous reader should see either
> > > (old data, old i_version) or (new data, new i_version), not a
> > > combination of the two.
> > >   - ditto for metadata modifications.
> > >   - the same should be true if there's a crash.
> > > 
> > > (If that's not possible, then I think we could live with a brief window
> > > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > > 
> > > > That said, I suppose it is possible for us to bump the counter, hand
> > > > that new counter value out to a NFS client and then the box crashes
> > > > before it makes it to the journal.
> > > > 
> > > > Not sure how big a problem that really is.
> > > 
> > > The other case I was wondering about may have been unclear.  Represent
> > > the state of a file by a (data, i_version) pair.  Say:
> > > 
> > >   - file is modified from (F, V) to (F', V+1).
> > >   - client reads and caches (F', V+1).
> > >   - server crashes before writeback, so disk still has (F, V).
> > >   - after restart, someone else modifies file to (F'', V+1).
> > >   - original client revalidates its cache, sees V+1, concludes
> > > file data is still F'.
> > > 
> > > This may not cause a real problem for clients depending only on
> > > traditional NFS close-to-open semantics.
> > > 
> > > 
> > 
> > No, I think that is a legitimate problem.
> > 
> > That said, after F'', the mtime would almost certainly be different
> > from the time after F', and that would likely be enough to prevent
> > confusion in NFS clients.
> 
> Oh, good point.  So, may be worth saying that anyone wanting to make
> sense of these across reboot should compare times as well (maybe that
> should be in nfs rfc's too).  I think that should be ctime not mtime,
> though?
> 

Yes, it might be worth a mention there. IIRC, it does mention that you
shouldn't just look at a single attribute for cache validation
purposes, but the wording is a bit vague. I can't find the section at
the moment though.

The more I think about it though, simply ensuring that we don't publish
 a new change attr until the inode update has hit the journal may be
the best we can do. I'd have to think about how to implement that
though.
-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Dave Chinner

On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

The change may be journalled, but it isn't guaranteed stable until
fsync is run on the inode.

NFS server operations commit the metadata changed by a modification
through ->commit_metadata or sync_inode_metadata() before the
response is sent back to the client, hence guaranteeing that
i_version changes through the NFS server are stable and durable.

This is not the case for normal operations done through the POSIX
API - the journalling is asynchronous and the only durability
guarantees are provided by fsync()

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.

Yup, this has aways been a problem when you mix posix applications
running on the NFS server modifying the same files as the NFS
clients are accessing and requiring synchronisation.

> Not sure how big a problem that really is.

This coherency problem has always existed on the server side...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Dave Chinner

On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

The change may be journalled, but it isn't guaranteed stable until
fsync is run on the inode.

NFS server operations commit the metadata changed by a modification
through ->commit_metadata or sync_inode_metadata() before the
response is sent back to the client, hence guaranteeing that
i_version changes through the NFS server are stable and durable.

This is not the case for normal operations done through the POSIX
API - the journalling is asynchronous and the only durability
guarantees are provided by fsync()

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.

Yup, this has aways been a problem when you mix posix applications
running on the NFS server modifying the same files as the NFS
clients are accessing and requiring synchronisation.

> Not sure how big a problem that really is.

This coherency problem has always existed on the server side...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > >   if there's a weaker guarantee that's still useful.  Do
> > > >   filesystems actually make ctime/mtime/i_version changes
> > > >   atomically with the changes that caused them?  What if a
> > > >   change attribute is exposed to an NFS client but doesn't make
> > > >   it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

No idea.  Anyway, I'd like to figure out some reasonable requirement
that we can document.

> 
> > - if I write to a file, a simultaneous reader should see either
> >   (old data, old i_version) or (new data, new i_version), not a
> >   combination of the two.
> > - ditto for metadata modifications.
> > - the same should be true if there's a crash.
> > 
> > (If that's not possible, then I think we could live with a brief window
> > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > 
> > > That said, I suppose it is possible for us to bump the counter, hand
> > > that new counter value out to a NFS client and then the box crashes
> > > before it makes it to the journal.
> > > 
> > > Not sure how big a problem that really is.
> > 
> > The other case I was wondering about may have been unclear.  Represent
> > the state of a file by a (data, i_version) pair.  Say:
> > 
> > - file is modified from (F, V) to (F', V+1).
> > - client reads and caches (F', V+1).
> > - server crashes before writeback, so disk still has (F, V).
> > - after restart, someone else modifies file to (F'', V+1).
> > - original client revalidates its cache, sees V+1, concludes
> >   file data is still F'.
> > 
> > This may not cause a real problem for clients depending only on
> > traditional NFS close-to-open semantics.
> > 
> > 
> 
> No, I think that is a legitimate problem.
> 
> That said, after F'', the mtime would almost certainly be different
> from the time after F', and that would likely be enough to prevent
> confusion in NFS clients.

Oh, good point.  So, may be worth saying that anyone wanting to make
sense of these across reboot should compare times as well (maybe that
should be in nfs rfc's too).  I think that should be ctime not mtime,
though?

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 02:46:53PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> > On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > > - It's durable; the above comparison still works if there were reboots
> > > >   between the two i_version checks.
> > > > - I don't know how realistic this is--we may need to figure out
> > > >   if there's a weaker guarantee that's still useful.  Do
> > > >   filesystems actually make ctime/mtime/i_version changes
> > > >   atomically with the changes that caused them?  What if a
> > > >   change attribute is exposed to an NFS client but doesn't make
> > > >   it to disk, and then that value is reused after reboot?
> > > > 
> > > 
> > > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > > the inode dirty and I think that will end up with the new i_version at
> > > least being journalled before __mark_inode_dirty returns.
> > 
> > So you think the filesystem can provide the atomicity?  In more detail:
> > 
> 
> Sorry, I hit send too quickly. That should have read:
> 
> "Yeah, there could be atomicity issues there."
> 
> I think providing that level of atomicity may be difficult, though
> maybe there's some way to make the querying of i_version block until
> the inode update has been journalled?

No idea.  Anyway, I'd like to figure out some reasonable requirement
that we can document.

> 
> > - if I write to a file, a simultaneous reader should see either
> >   (old data, old i_version) or (new data, new i_version), not a
> >   combination of the two.
> > - ditto for metadata modifications.
> > - the same should be true if there's a crash.
> > 
> > (If that's not possible, then I think we could live with a brief window
> > of (new data, old i_version) as long as it doesn't persist beyond sync?)
> > 
> > > That said, I suppose it is possible for us to bump the counter, hand
> > > that new counter value out to a NFS client and then the box crashes
> > > before it makes it to the journal.
> > > 
> > > Not sure how big a problem that really is.
> > 
> > The other case I was wondering about may have been unclear.  Represent
> > the state of a file by a (data, i_version) pair.  Say:
> > 
> > - file is modified from (F, V) to (F', V+1).
> > - client reads and caches (F', V+1).
> > - server crashes before writeback, so disk still has (F, V).
> > - after restart, someone else modifies file to (F'', V+1).
> > - original client revalidates its cache, sees V+1, concludes
> >   file data is still F'.
> > 
> > This may not cause a real problem for clients depending only on
> > traditional NFS close-to-open semantics.
> > 
> > 
> 
> No, I think that is a legitimate problem.
> 
> That said, after F'', the mtime would almost certainly be different
> from the time after F', and that would likely be enough to prevent
> confusion in NFS clients.

Oh, good point.  So, may be worth saying that anyone wanting to make
sense of these across reboot should compare times as well (maybe that
should be in nfs rfc's too).  I think that should be ctime not mtime,
though?

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

So you think the filesystem can provide the atomicity?  In more detail:

- if I write to a file, a simultaneous reader should see either
  (old data, old i_version) or (new data, new i_version), not a
  combination of the two.
- ditto for metadata modifications.
- the same should be true if there's a crash.

(If that's not possible, then I think we could live with a brief window
of (new data, old i_version) as long as it doesn't persist beyond sync?)

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.
> 
> Not sure how big a problem that really is.

The other case I was wondering about may have been unclear.  Represent
the state of a file by a (data, i_version) pair.  Say:

- file is modified from (F, V) to (F', V+1).
- client reads and caches (F', V+1).
- server crashes before writeback, so disk still has (F, V).
- after restart, someone else modifies file to (F'', V+1).
- original client revalidates its cache, sees V+1, concludes
  file data is still F'.

This may not cause a real problem for clients depending only on
traditional NFS close-to-open semantics.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - It's durable; the above comparison still works if there were reboots
> >   between the two i_version checks.
> > - I don't know how realistic this is--we may need to figure out
> >   if there's a weaker guarantee that's still useful.  Do
> >   filesystems actually make ctime/mtime/i_version changes
> >   atomically with the changes that caused them?  What if a
> >   change attribute is exposed to an NFS client but doesn't make
> >   it to disk, and then that value is reused after reboot?
> > 
> 
> Yeah, there could be atomicity there. If we bump i_version, we'll mark
> the inode dirty and I think that will end up with the new i_version at
> least being journalled before __mark_inode_dirty returns.

So you think the filesystem can provide the atomicity?  In more detail:

- if I write to a file, a simultaneous reader should see either
  (old data, old i_version) or (new data, new i_version), not a
  combination of the two.
- ditto for metadata modifications.
- the same should be true if there's a crash.

(If that's not possible, then I think we could live with a brief window
of (new data, old i_version) as long as it doesn't persist beyond sync?)

> That said, I suppose it is possible for us to bump the counter, hand
> that new counter value out to a NFS client and then the box crashes
> before it makes it to the journal.
> 
> Not sure how big a problem that really is.

The other case I was wondering about may have been unclear.  Represent
the state of a file by a (data, i_version) pair.  Say:

- file is modified from (F, V) to (F', V+1).
- client reads and caches (F', V+1).
- server crashes before writeback, so disk still has (F, V).
- after restart, someone else modifies file to (F'', V+1).
- original client revalidates its cache, sees V+1, concludes
  file data is still F'.

This may not cause a real problem for clients depending only on
traditional NFS close-to-open semantics.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton

On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > >   between the two i_version checks.
> > >   - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful.  Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them?  What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > > 
> > 
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
> 
> So you think the filesystem can provide the atomicity?  In more detail:
> 

Sorry, I hit send too quickly. That should have read:

"Yeah, there could be atomicity issues there."

I think providing that level of atomicity may be difficult, though
maybe there's some way to make the querying of i_version block until
the inode update has been journalled?

>   - if I write to a file, a simultaneous reader should see either
> (old data, old i_version) or (new data, new i_version), not a
> combination of the two.
>   - ditto for metadata modifications.
>   - the same should be true if there's a crash.
> 
> (If that's not possible, then I think we could live with a brief window
> of (new data, old i_version) as long as it doesn't persist beyond sync?)
> 
> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
> > 
> > Not sure how big a problem that really is.
> 
> The other case I was wondering about may have been unclear.  Represent
> the state of a file by a (data, i_version) pair.  Say:
> 
>   - file is modified from (F, V) to (F', V+1).
>   - client reads and caches (F', V+1).
>   - server crashes before writeback, so disk still has (F, V).
>   - after restart, someone else modifies file to (F'', V+1).
>   - original client revalidates its cache, sees V+1, concludes
> file data is still F'.
> 
> This may not cause a real problem for clients depending only on
> traditional NFS close-to-open semantics.
> 
> 

No, I think that is a legitimate problem.

That said, after F'', the mtime would almost certainly be different
from the time after F', and that would likely be enough to prevent
confusion in NFS clients.

Applications that are just looking at i_version via statx() could get
confused though.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton

On Tue, 2017-03-21 at 14:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > > - It's durable; the above comparison still works if there were reboots
> > >   between the two i_version checks.
> > >   - I don't know how realistic this is--we may need to figure out
> > > if there's a weaker guarantee that's still useful.  Do
> > > filesystems actually make ctime/mtime/i_version changes
> > > atomically with the changes that caused them?  What if a
> > > change attribute is exposed to an NFS client but doesn't make
> > > it to disk, and then that value is reused after reboot?
> > > 
> > 
> > Yeah, there could be atomicity there. If we bump i_version, we'll mark
> > the inode dirty and I think that will end up with the new i_version at
> > least being journalled before __mark_inode_dirty returns.
> 
> So you think the filesystem can provide the atomicity?  In more detail:
> 

Sorry, I hit send too quickly. That should have read:

"Yeah, there could be atomicity issues there."

I think providing that level of atomicity may be difficult, though
maybe there's some way to make the querying of i_version block until
the inode update has been journalled?

>   - if I write to a file, a simultaneous reader should see either
> (old data, old i_version) or (new data, new i_version), not a
> combination of the two.
>   - ditto for metadata modifications.
>   - the same should be true if there's a crash.
> 
> (If that's not possible, then I think we could live with a brief window
> of (new data, old i_version) as long as it doesn't persist beyond sync?)
> 
> > That said, I suppose it is possible for us to bump the counter, hand
> > that new counter value out to a NFS client and then the box crashes
> > before it makes it to the journal.
> > 
> > Not sure how big a problem that really is.
> 
> The other case I was wondering about may have been unclear.  Represent
> the state of a file by a (data, i_version) pair.  Say:
> 
>   - file is modified from (F, V) to (F', V+1).
>   - client reads and caches (F', V+1).
>   - server crashes before writeback, so disk still has (F, V).
>   - after restart, someone else modifies file to (F'', V+1).
>   - original client revalidates its cache, sees V+1, concludes
> file data is still F'.
> 
> This may not cause a real problem for clients depending only on
> traditional NFS close-to-open semantics.
> 
> 

No, I think that is a legitimate problem.

That said, after F'', the mtime would almost certainly be different
from the time after F', and that would likely be enough to prevent
confusion in NFS clients.

Applications that are just looking at i_version via statx() could get
confused though.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > >   - NFS doesn't actually require that it increases, but I think it
> > > should.  I assume 64 bits means we don't need a discussion of
> > > wraparound.
> > 
> > I thought NFS spec required that you be able to recognize old change
> > attributes so that they can be discarded. I could be wrong here though.
> > I'd have to go back and look through the spec to be sure.
> 
> https://tools.ietf.org/html/rfc7862#section-10

So, I'm suggesting we implement this one:

NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value
  MUST monotonically increase for every atomic change to the file
  attributes, data, or directory contents.

It may be a slight lie--after your patches we wouldn't actually increase
"for every atomic change".  I think that's OK.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 01:37:04PM -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> > On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > >   - NFS doesn't actually require that it increases, but I think it
> > > should.  I assume 64 bits means we don't need a discussion of
> > > wraparound.
> > 
> > I thought NFS spec required that you be able to recognize old change
> > attributes so that they can be discarded. I could be wrong here though.
> > I'd have to go back and look through the spec to be sure.
> 
> https://tools.ietf.org/html/rfc7862#section-10

So, I'm suggesting we implement this one:

NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR:  The change attribute value
  MUST monotonically increase for every atomic change to the file
  attributes, data, or directory contents.

It may be a slight lie--after your patches we wouldn't actually increase
"for every atomic change".  I think that's OK.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - NFS doesn't actually require that it increases, but I think it
> >   should.  I assume 64 bits means we don't need a discussion of
> >   wraparound.
> 
> I thought NFS spec required that you be able to recognize old change
> attributes so that they can be discarded. I could be wrong here though.
> I'd have to go back and look through the spec to be sure.

https://tools.ietf.org/html/rfc7862#section-10

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 01:23:24PM -0400, Jeff Layton wrote:
> On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> > - NFS doesn't actually require that it increases, but I think it
> >   should.  I assume 64 bits means we don't need a discussion of
> >   wraparound.
> 
> I thought NFS spec required that you be able to recognize old change
> attributes so that they can be discarded. I could be wrong here though.
> I'd have to go back and look through the spec to be sure.

https://tools.ietf.org/html/rfc7862#section-10

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton

On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > > To me, the interesting question is whether this allows us to turn on
> > > i_version updates by default on xfs and ext4.
> > 
> > XFS v5 file systems have it on by default.
> 
> Great, thanks.
> 
> > Although we'll still need to agree on the exact semantics of i_version
> > before it's going to be useful.
> 
> Once it's figured out maybe we should write it up for a manpage that
> could be used if statx starts exposing it to userspace.
> 
> A first attempt:
> 
> - It's a u64.
> 
> - It works for regular files and directories.  (What about symlinks or
>   other special types?)
> 
> - It changes between two checks if and only if there were intervening
>   data or metadata changes.  The change will always be an increase, but
>   the amount of the increase is meaningless.
>   - NFS doesn't actually require that it increases, but I think it
> should.  I assume 64 bits means we don't need a discussion of
> wraparound.

I thought NFS spec required that you be able to recognize old change
attributes so that they can be discarded. I could be wrong here though.
I'd have to go back and look through the spec to be sure.

>   - AFS wants an actual counter: if you get i_version X, then
> write twice, then get i_version X+2, you're allowed to assume
> your writes were the only modifications.  Let's ignore this
> for now.  In the future if someone explains how to count
> operations, then we can extend the interface to tell the
> caller it can get those extra semantics.
> 
> - It's durable; the above comparison still works if there were reboots
>   between the two i_version checks.
>   - I don't know how realistic this is--we may need to figure out
> if there's a weaker guarantee that's still useful.  Do
> filesystems actually make ctime/mtime/i_version changes
> atomically with the changes that caused them?  What if a
> change attribute is exposed to an NFS client but doesn't make
> it to disk, and then that value is reused after reboot?
> 

Yeah, there could be atomicity there. If we bump i_version, we'll mark
the inode dirty and I think that will end up with the new i_version at
least being journalled before __mark_inode_dirty returns.

That said, I suppose it is possible for us to bump the counter, hand
that new counter value out to a NFS client and then the box crashes
before it makes it to the journal.

Not sure how big a problem that really is.

> Am I missing any issues?
> 

No, I think you have it covered, and that's pretty much exactly what I
had in mind as far as semantics go. Thanks for writing it up!

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Jeff Layton

On Tue, 2017-03-21 at 12:30 -0400, J. Bruce Fields wrote:
> On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> > On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > > To me, the interesting question is whether this allows us to turn on
> > > i_version updates by default on xfs and ext4.
> > 
> > XFS v5 file systems have it on by default.
> 
> Great, thanks.
> 
> > Although we'll still need to agree on the exact semantics of i_version
> > before it's going to be useful.
> 
> Once it's figured out maybe we should write it up for a manpage that
> could be used if statx starts exposing it to userspace.
> 
> A first attempt:
> 
> - It's a u64.
> 
> - It works for regular files and directories.  (What about symlinks or
>   other special types?)
> 
> - It changes between two checks if and only if there were intervening
>   data or metadata changes.  The change will always be an increase, but
>   the amount of the increase is meaningless.
>   - NFS doesn't actually require that it increases, but I think it
> should.  I assume 64 bits means we don't need a discussion of
> wraparound.

I thought NFS spec required that you be able to recognize old change
attributes so that they can be discarded. I could be wrong here though.
I'd have to go back and look through the spec to be sure.

>   - AFS wants an actual counter: if you get i_version X, then
> write twice, then get i_version X+2, you're allowed to assume
> your writes were the only modifications.  Let's ignore this
> for now.  In the future if someone explains how to count
> operations, then we can extend the interface to tell the
> caller it can get those extra semantics.
> 
> - It's durable; the above comparison still works if there were reboots
>   between the two i_version checks.
>   - I don't know how realistic this is--we may need to figure out
> if there's a weaker guarantee that's still useful.  Do
> filesystems actually make ctime/mtime/i_version changes
> atomically with the changes that caused them?  What if a
> change attribute is exposed to an NFS client but doesn't make
> it to disk, and then that value is reused after reboot?
> 

Yeah, there could be atomicity there. If we bump i_version, we'll mark
the inode dirty and I think that will end up with the new i_version at
least being journalled before __mark_inode_dirty returns.

That said, I suppose it is possible for us to bump the counter, hand
that new counter value out to a NFS client and then the box crashes
before it makes it to the journal.

Not sure how big a problem that really is.

> Am I missing any issues?
> 

No, I think you have it covered, and that's pretty much exactly what I
had in mind as far as semantics go. Thanks for writing it up!

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > To me, the interesting question is whether this allows us to turn on
> > i_version updates by default on xfs and ext4.
> 
> XFS v5 file systems have it on by default.

Great, thanks.

> Although we'll still need to agree on the exact semantics of i_version
> before it's going to be useful.

Once it's figured out maybe we should write it up for a manpage that
could be used if statx starts exposing it to userspace.

A first attempt:

- It's a u64.

- It works for regular files and directories.  (What about symlinks or
  other special types?)

- It changes between two checks if and only if there were intervening
  data or metadata changes.  The change will always be an increase, but
  the amount of the increase is meaningless.
- NFS doesn't actually require that it increases, but I think it
  should.  I assume 64 bits means we don't need a discussion of
  wraparound.
- AFS wants an actual counter: if you get i_version X, then
  write twice, then get i_version X+2, you're allowed to assume
  your writes were the only modifications.  Let's ignore this
  for now.  In the future if someone explains how to count
  operations, then we can extend the interface to tell the
  caller it can get those extra semantics.

- It's durable; the above comparison still works if there were reboots
  between the two i_version checks.
- I don't know how realistic this is--we may need to figure out
  if there's a weaker guarantee that's still useful.  Do
  filesystems actually make ctime/mtime/i_version changes
  atomically with the changes that caused them?  What if a
  change attribute is exposed to an NFS client but doesn't make
  it to disk, and then that value is reused after reboot?

Am I missing any issues?

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread J. Bruce Fields

On Tue, Mar 21, 2017 at 06:45:00AM -0700, Christoph Hellwig wrote:
> On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> > To me, the interesting question is whether this allows us to turn on
> > i_version updates by default on xfs and ext4.
> 
> XFS v5 file systems have it on by default.

Great, thanks.

> Although we'll still need to agree on the exact semantics of i_version
> before it's going to be useful.

Once it's figured out maybe we should write it up for a manpage that
could be used if statx starts exposing it to userspace.

A first attempt:

- It's a u64.

- It works for regular files and directories.  (What about symlinks or
  other special types?)

- It changes between two checks if and only if there were intervening
  data or metadata changes.  The change will always be an increase, but
  the amount of the increase is meaningless.
- NFS doesn't actually require that it increases, but I think it
  should.  I assume 64 bits means we don't need a discussion of
  wraparound.
- AFS wants an actual counter: if you get i_version X, then
  write twice, then get i_version X+2, you're allowed to assume
  your writes were the only modifications.  Let's ignore this
  for now.  In the future if someone explains how to count
  operations, then we can extend the interface to tell the
  caller it can get those extra semantics.

- It's durable; the above comparison still works if there were reboots
  between the two i_version checks.
- I don't know how realistic this is--we may need to figure out
  if there's a weaker guarantee that's still useful.  Do
  filesystems actually make ctime/mtime/i_version changes
  atomically with the changes that caused them?  What if a
  change attribute is exposed to an NFS client but doesn't make
  it to disk, and then that value is reused after reboot?

Am I missing any issues?

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Christoph Hellwig

On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> To me, the interesting question is whether this allows us to turn on
> i_version updates by default on xfs and ext4.

XFS v5 file systems have it on by default.  Although we'll still need
to agree on the exact semantics of i_version before it's going to be
useful.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-21 Thread Christoph Hellwig

On Mon, Mar 20, 2017 at 05:43:27PM -0400, J. Bruce Fields wrote:
> To me, the interesting question is whether this allows us to turn on
> i_version updates by default on xfs and ext4.

XFS v5 file systems have it on by default.  Although we'll still need
to agree on the exact semantics of i_version before it's going to be
useful.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-20 Thread J. Bruce Fields

On Thu, Dec 22, 2016 at 09:42:04AM -0500, Jeff Layton wrote:
> On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > 
> > > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > > this, these filesystems must log the inode to disk whenever the
> > > i_version counter changes. That has a non-zero performance impact,
> > > especially on write-heavy workloads, because we end up dirtying the
> > > inode metadata on every write, not just when the times change. [1]
> > 
> > Do you have numbers to justify these changes?
> 
> I have numbers. As to whether they justify the changes, I'm not sure.
> This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
> XFS, with this fio jobfile:

To me, the interesting question is whether this allows us to turn on
i_version updates by default on xfs and ext4.

When Josef looked at doing that previously he withdrew the patch due to
performance regressions.  I think the most useful thread started here:


http://lkml.kernel.org/r/1337092396-3272-1-git-send-email-jo...@redhat.com

Skimming quickly  I think the regression was also in the small-write
case.  So apparently that was thought to reveal a real problem?

So if you've mostly eliminated that regression, then that's good
motivation for your patches.  (Though I think in addition to comparing
the patched and unpatched i_version case, we need to compare to the
unpatched not-i_version case.  I'm not clear whether you did that.)

--b.

> 
> 8<--
> [global]
> direct=0
> size=2g
> filesize=512m
> bsrange=1-1
> timeout=60
> numjobs=1
> directory=/mnt/scratch
> 
> [f1]
> filename=randwrite
> rw=randwrite
> 8<--
> 
> Unpatched kernel:
>   WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, 
> mint=6msec, maxt=6msec
> 
> Patched kernel:
>   WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, 
> mint=6msec, maxt=6msec
> 
> So quite a difference there and it's pretty consistent across runs. If I
> change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
> variation between the two doesn't seem to be significant (numbers vary
> as much between runs on the same kernels and are roughly the same).
> 
> Playing with buffered I/O sizes between 1 byte and 4k shows that as the
> I/O sizes get larger, this makes less difference (which is what I'd
> expect).
> 
> Previous testing with ext4 shows roughly the same results. btrfs shows
> some benefit here but significantly less than with ext4 or xfs. Not sure
> why that is yet -- maybe CoW effects?
> 
> That said, I don't have a great test rig for this. I'm using VMs with a
> dedicated LVM volume that's on a random SSD I had laying around. It
> could use testing on a wider set of configurations and workloads.
> 
> I was also hoping that others may have workloads that they think might
> be (postively or negatively) affected by these changes. If you can think
> of any in particular, then I'm interested to hear about them.
> 
> -- 
> Jeff Layton 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-20 Thread J. Bruce Fields

On Thu, Dec 22, 2016 at 09:42:04AM -0500, Jeff Layton wrote:
> On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > 
> > > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > > this, these filesystems must log the inode to disk whenever the
> > > i_version counter changes. That has a non-zero performance impact,
> > > especially on write-heavy workloads, because we end up dirtying the
> > > inode metadata on every write, not just when the times change. [1]
> > 
> > Do you have numbers to justify these changes?
> 
> I have numbers. As to whether they justify the changes, I'm not sure.
> This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
> XFS, with this fio jobfile:

To me, the interesting question is whether this allows us to turn on
i_version updates by default on xfs and ext4.

When Josef looked at doing that previously he withdrew the patch due to
performance regressions.  I think the most useful thread started here:


http://lkml.kernel.org/r/1337092396-3272-1-git-send-email-jo...@redhat.com

Skimming quickly  I think the regression was also in the small-write
case.  So apparently that was thought to reveal a real problem?

So if you've mostly eliminated that regression, then that's good
motivation for your patches.  (Though I think in addition to comparing
the patched and unpatched i_version case, we need to compare to the
unpatched not-i_version case.  I'm not clear whether you did that.)

--b.

> 
> 8<--
> [global]
> direct=0
> size=2g
> filesize=512m
> bsrange=1-1
> timeout=60
> numjobs=1
> directory=/mnt/scratch
> 
> [f1]
> filename=randwrite
> rw=randwrite
> 8<--
> 
> Unpatched kernel:
>   WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, 
> mint=6msec, maxt=6msec
> 
> Patched kernel:
>   WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, 
> mint=6msec, maxt=6msec
> 
> So quite a difference there and it's pretty consistent across runs. If I
> change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
> variation between the two doesn't seem to be significant (numbers vary
> as much between runs on the same kernels and are roughly the same).
> 
> Playing with buffered I/O sizes between 1 byte and 4k shows that as the
> I/O sizes get larger, this makes less difference (which is what I'd
> expect).
> 
> Previous testing with ext4 shows roughly the same results. btrfs shows
> some benefit here but significantly less than with ext4 or xfs. Not sure
> why that is yet -- maybe CoW effects?
> 
> That said, I don't have a great test rig for this. I'm using VMs with a
> dedicated LVM volume that's on a random SSD I had laying around. It
> could use testing on a wider set of configurations and workloads.
> 
> I was also hoping that others may have workloads that they think might
> be (postively or negatively) affected by these changes. If you can think
> of any in particular, then I'm interested to hear about them.
> 
> -- 
> Jeff Layton 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-08 Thread J. Bruce Fields

On Fri, Mar 03, 2017 at 07:53:57PM -0500, Jeff Layton wrote:
> On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > > counter, by exploiting the fact that we don't need to increment it
> > > if no one is looking at it. We can also clean up the code to prepare
> > > to eventually expose this value via statx().
> > > 
> > > The inode->i_version field is supposed to be a value that changes
> > > whenever there is any data or metadata change to the inode. Some
> > > filesystems use it internally to detect directory changes during
> > > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > > set. IMA will also use it (though it's not clear to me that that
> > > works 100% -- no check for MS_I_VERSION there).
> > 
> > I'm a little confused about the internal uses for directories.  Is it
> > necessarily kept on disk?
> 
> No, they aren't necessarily stored on disk, and in fact they aren't on
> most (maybe all?) of those filesystems. It's just a convenient place to
> store a dir change value that is subsequently checked in readdir
> operations.
> 
> That's part of the "fun" of untangling this mess. ;)
> 
> > In cases it's not, then there's not the same
> > performance issue, right? 
> 
> Right, I don't think it's really a big deal for most of those and I'm
> not terribly concerned about the i_version counter on directory change
> operations. An extra atomic op on a directory change seems unlikely to
> hurt anything.
> 
> The main purpose of this is to improve the situation with small writes.
> This should also help pave the way for fs' like NFS and Ceph that
> implement a version counter but may not necessarily bump it on every
> change.
> 
> I think once we have things more consistent, we'll be able to consider
> exposing the i_version counter to userland via statx.
> 
> > Is there any risk these patches make
> > performance slightly worse in that case?
> > 
> 
> Maybe, but I think that risk is pretty low. Note that I haven't measured
> that specifically here, so I could be completely wrong.
> 
> If it is a problem, we could consider unioning this thing with a non-
> atomic field for the directory change cases, but that would add some
> complexity and I'm not sure it's worth it. I'd want to measure it first.

Makes sense to me.  I agree that it's unlikely to be a problem, just
wanted to make sure I understood

Anyway, I'm a great fan of the basic idea, I hope we can get this done.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-08 Thread J. Bruce Fields

On Fri, Mar 03, 2017 at 07:53:57PM -0500, Jeff Layton wrote:
> On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> > On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > > counter, by exploiting the fact that we don't need to increment it
> > > if no one is looking at it. We can also clean up the code to prepare
> > > to eventually expose this value via statx().
> > > 
> > > The inode->i_version field is supposed to be a value that changes
> > > whenever there is any data or metadata change to the inode. Some
> > > filesystems use it internally to detect directory changes during
> > > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > > set. IMA will also use it (though it's not clear to me that that
> > > works 100% -- no check for MS_I_VERSION there).
> > 
> > I'm a little confused about the internal uses for directories.  Is it
> > necessarily kept on disk?
> 
> No, they aren't necessarily stored on disk, and in fact they aren't on
> most (maybe all?) of those filesystems. It's just a convenient place to
> store a dir change value that is subsequently checked in readdir
> operations.
> 
> That's part of the "fun" of untangling this mess. ;)
> 
> > In cases it's not, then there's not the same
> > performance issue, right? 
> 
> Right, I don't think it's really a big deal for most of those and I'm
> not terribly concerned about the i_version counter on directory change
> operations. An extra atomic op on a directory change seems unlikely to
> hurt anything.
> 
> The main purpose of this is to improve the situation with small writes.
> This should also help pave the way for fs' like NFS and Ceph that
> implement a version counter but may not necessarily bump it on every
> change.
> 
> I think once we have things more consistent, we'll be able to consider
> exposing the i_version counter to userland via statx.
> 
> > Is there any risk these patches make
> > performance slightly worse in that case?
> > 
> 
> Maybe, but I think that risk is pretty low. Note that I haven't measured
> that specifically here, so I could be completely wrong.
> 
> If it is a problem, we could consider unioning this thing with a non-
> atomic field for the directory change cases, but that would add some
> complexity and I'm not sure it's worth it. I'd want to measure it first.

Makes sense to me.  I agree that it's unlikely to be a problem, just
wanted to make sure I understood

Anyway, I'm a great fan of the basic idea, I hope we can get this done.

--b.

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread Jeff Layton

On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > counter, by exploiting the fact that we don't need to increment it
> > if no one is looking at it. We can also clean up the code to prepare
> > to eventually expose this value via statx().
> > 
> > The inode->i_version field is supposed to be a value that changes
> > whenever there is any data or metadata change to the inode. Some
> > filesystems use it internally to detect directory changes during
> > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > set. IMA will also use it (though it's not clear to me that that
> > works 100% -- no check for MS_I_VERSION there).
> 
> I'm a little confused about the internal uses for directories.  Is it
> necessarily kept on disk?

No, they aren't necessarily stored on disk, and in fact they aren't on
most (maybe all?) of those filesystems. It's just a convenient place to
store a dir change value that is subsequently checked in readdir
operations.

That's part of the "fun" of untangling this mess. ;)

> In cases it's not, then there's not the same
> performance issue, right? 

Right, I don't think it's really a big deal for most of those and I'm
not terribly concerned about the i_version counter on directory change
operations. An extra atomic op on a directory change seems unlikely to
hurt anything.

The main purpose of this is to improve the situation with small writes.
This should also help pave the way for fs' like NFS and Ceph that
implement a version counter but may not necessarily bump it on every
change.

I think once we have things more consistent, we'll be able to consider
exposing the i_version counter to userland via statx.

> Is there any risk these patches make
> performance slightly worse in that case?
> 

Maybe, but I think that risk is pretty low. Note that I haven't measured
that specifically here, so I could be completely wrong.

If it is a problem, we could consider unioning this thing with a non-
atomic field for the directory change cases, but that would add some
complexity and I'm not sure it's worth it. I'd want to measure it first.

> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes.
> 
> On those filesystems that's done for both files and directories, right?
> 

Yes.

> > That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> > 
> > It turns out though that none of these users of i_version require that
> > i_version change on every change to the file. The only real requirement
> > is that it be different if _something_ changed since the last time we
> > queried for it. [2]
> > 
> > So, if we simply keep track of when something queries the value, we
> > can avoid bumping the counter and that metadata update.
> > 
> > This patchset implements this:
> > 
> > It starts with some small cleanup patches to just remove any mention of
> > the i_version counter in filesystems that don't actually use it.
> > 
> > Then, we add a new set of static inlines for managing the counter. The
> > initial version should work identically to what we have now. Then, all
> > of the remaining filesystems that use i_version are converted to the new
> > inlines.
> > 
> > Once that's in place, we switch to a new implementation that allows us
> > to track readers of i_version counter, and only bump it when it's
> > necessary or convenient (i.e. we're going to disk anyway).
> > 
> > The final patch switches from a scheme that uses the i_lock to serialize
> > the counter updates during write to an atomic64_t. That's a wash
> > performance-wise in my testing, but I like not having to take the i_lock
> > down where it could end up nested inside other locks.
> > 
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> > 
> > The pessimal workload here is 1 byte writes, and it helps that
> > significantly. Of course, that's not a real-world workload.
> > 
> > A tiobench-example.fio workload also shows some modest performance
> > gains, and I've gotten mails from the kernel test robot that show some
> > significant performance gains on some microbenchmarks (case-msync-mt in
> > the vm-scalability testsuite to be specific).
> > 
> > I'm happy to run other workloads if anyone can suggest them.
> > 
> > At this point, the patchset works and does what it's expected to do in
> > my own testing. It seems like it's at least a modest performance win
> > across all 3 major disk-based filesystems. It may also encourage others
> > to implement i_version as well since it reduces that

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread Jeff Layton

On Fri, 2017-03-03 at 18:00 -0500, J. Bruce Fields wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > tl;dr: I think we can greatly reduce the cost of the inode->i_version
> > counter, by exploiting the fact that we don't need to increment it
> > if no one is looking at it. We can also clean up the code to prepare
> > to eventually expose this value via statx().
> > 
> > The inode->i_version field is supposed to be a value that changes
> > whenever there is any data or metadata change to the inode. Some
> > filesystems use it internally to detect directory changes during
> > readdir. knfsd will use it if the filesystem has MS_I_VERSION
> > set. IMA will also use it (though it's not clear to me that that
> > works 100% -- no check for MS_I_VERSION there).
> 
> I'm a little confused about the internal uses for directories.  Is it
> necessarily kept on disk?

No, they aren't necessarily stored on disk, and in fact they aren't on
most (maybe all?) of those filesystems. It's just a convenient place to
store a dir change value that is subsequently checked in readdir
operations.

That's part of the "fun" of untangling this mess. ;)

> In cases it's not, then there's not the same
> performance issue, right? 

Right, I don't think it's really a big deal for most of those and I'm
not terribly concerned about the i_version counter on directory change
operations. An extra atomic op on a directory change seems unlikely to
hurt anything.

The main purpose of this is to improve the situation with small writes.
This should also help pave the way for fs' like NFS and Ceph that
implement a version counter but may not necessarily bump it on every
change.

I think once we have things more consistent, we'll be able to consider
exposing the i_version counter to userland via statx.

> Is there any risk these patches make
> performance slightly worse in that case?
> 

Maybe, but I think that risk is pretty low. Note that I haven't measured
that specifically here, so I could be completely wrong.

If it is a problem, we could consider unioning this thing with a non-
atomic field for the directory change cases, but that would add some
complexity and I'm not sure it's worth it. I'd want to measure it first.

> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes.
> 
> On those filesystems that's done for both files and directories, right?
> 

Yes.

> > That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> > 
> > It turns out though that none of these users of i_version require that
> > i_version change on every change to the file. The only real requirement
> > is that it be different if _something_ changed since the last time we
> > queried for it. [2]
> > 
> > So, if we simply keep track of when something queries the value, we
> > can avoid bumping the counter and that metadata update.
> > 
> > This patchset implements this:
> > 
> > It starts with some small cleanup patches to just remove any mention of
> > the i_version counter in filesystems that don't actually use it.
> > 
> > Then, we add a new set of static inlines for managing the counter. The
> > initial version should work identically to what we have now. Then, all
> > of the remaining filesystems that use i_version are converted to the new
> > inlines.
> > 
> > Once that's in place, we switch to a new implementation that allows us
> > to track readers of i_version counter, and only bump it when it's
> > necessary or convenient (i.e. we're going to disk anyway).
> > 
> > The final patch switches from a scheme that uses the i_lock to serialize
> > the counter updates during write to an atomic64_t. That's a wash
> > performance-wise in my testing, but I like not having to take the i_lock
> > down where it could end up nested inside other locks.
> > 
> > With this, we reduce inode metadata updates across all 3 filesystems
> > down to roughly the frequency of the timestamp granularity, particularly
> > when it's not being queried (the vastly common case).
> > 
> > The pessimal workload here is 1 byte writes, and it helps that
> > significantly. Of course, that's not a real-world workload.
> > 
> > A tiobench-example.fio workload also shows some modest performance
> > gains, and I've gotten mails from the kernel test robot that show some
> > significant performance gains on some microbenchmarks (case-msync-mt in
> > the vm-scalability testsuite to be specific).
> > 
> > I'm happy to run other workloads if anyone can suggest them.
> > 
> > At this point, the patchset works and does what it's expected to do in
> > my own testing. It seems like it's at least a modest performance win
> > across all 3 major disk-based filesystems. It may also encourage others
> > to implement i_version as well since it reduces that

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread J. Bruce Fields

On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> tl;dr: I think we can greatly reduce the cost of the inode->i_version
> counter, by exploiting the fact that we don't need to increment it
> if no one is looking at it. We can also clean up the code to prepare
> to eventually expose this value via statx().
> 
> The inode->i_version field is supposed to be a value that changes
> whenever there is any data or metadata change to the inode. Some
> filesystems use it internally to detect directory changes during
> readdir. knfsd will use it if the filesystem has MS_I_VERSION
> set. IMA will also use it (though it's not clear to me that that
> works 100% -- no check for MS_I_VERSION there).

I'm a little confused about the internal uses for directories.  Is it
necessarily kept on disk?  In cases it's not, then there's not the same
performance issue, right?  Is there any risk these patches make
performance slightly worse in that case?

> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes.

On those filesystems that's done for both files and directories, right?

--b.

> That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]
> 
> It turns out though that none of these users of i_version require that
> i_version change on every change to the file. The only real requirement
> is that it be different if _something_ changed since the last time we
> queried for it. [2]
> 
> So, if we simply keep track of when something queries the value, we
> can avoid bumping the counter and that metadata update.
> 
> This patchset implements this:
> 
> It starts with some small cleanup patches to just remove any mention of
> the i_version counter in filesystems that don't actually use it.
> 
> Then, we add a new set of static inlines for managing the counter. The
> initial version should work identically to what we have now. Then, all
> of the remaining filesystems that use i_version are converted to the new
> inlines.
> 
> Once that's in place, we switch to a new implementation that allows us
> to track readers of i_version counter, and only bump it when it's
> necessary or convenient (i.e. we're going to disk anyway).
> 
> The final patch switches from a scheme that uses the i_lock to serialize
> the counter updates during write to an atomic64_t. That's a wash
> performance-wise in my testing, but I like not having to take the i_lock
> down where it could end up nested inside other locks.
> 
> With this, we reduce inode metadata updates across all 3 filesystems
> down to roughly the frequency of the timestamp granularity, particularly
> when it's not being queried (the vastly common case).
> 
> The pessimal workload here is 1 byte writes, and it helps that
> significantly. Of course, that's not a real-world workload.
> 
> A tiobench-example.fio workload also shows some modest performance
> gains, and I've gotten mails from the kernel test robot that show some
> significant performance gains on some microbenchmarks (case-msync-mt in
> the vm-scalability testsuite to be specific).
> 
> I'm happy to run other workloads if anyone can suggest them.
> 
> At this point, the patchset works and does what it's expected to do in
> my own testing. It seems like it's at least a modest performance win
> across all 3 major disk-based filesystems. It may also encourage others
> to implement i_version as well since it reduces that cost.
> 
> Is this an avenue that's worthwhile to pursue?
> 
> Note that I think we may have other changes coming in the future that
> will make this sort of cleanup necessary anyway. I'd like to plug in the
> Ceph change attribute here eventually, and that will require something
> like this anyway.
> 
> Thoughts, comments and suggestions are welcome...
> 
> ---
> 
> [1]: On ext4 it must be turned on with the i_version mount option,
>  mostly due to fears of incurring this impact, AFAICT.
> 
> [2]: NFS also recommends that it appear to increase in value over time, so
>  that clients can discard metadata updates that are older than ones
>  they've already seen.
> 
> Jeff Layton (30):
>   lustre: don't set f_version in ll_readdir
>   ecryptfs: remove unnecessary i_version bump
>   ceph: remove the bump of i_version
>   f2fs: don't bother setting i_version
>   hpfs: don't bother with the i_version counter
>   jfs: remove initialization of i_version counter
>   nilfs2: remove inode->i_version initialization
>   orangefs: remove initialization of i_version
>   reiserfs: remove unneeded i_version bump
>   ntfs: remove i_version handling
>   fs: new API for handling i_version
>   fat: convert to new i_version API
>   affs: convert to new i_version API
>   afs: convert to new i_version API
>   btrfs: convert to new i_version API
>   exofs: switch to new i_version API
>

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2017-03-03 Thread J. Bruce Fields

On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> tl;dr: I think we can greatly reduce the cost of the inode->i_version
> counter, by exploiting the fact that we don't need to increment it
> if no one is looking at it. We can also clean up the code to prepare
> to eventually expose this value via statx().
> 
> The inode->i_version field is supposed to be a value that changes
> whenever there is any data or metadata change to the inode. Some
> filesystems use it internally to detect directory changes during
> readdir. knfsd will use it if the filesystem has MS_I_VERSION
> set. IMA will also use it (though it's not clear to me that that
> works 100% -- no check for MS_I_VERSION there).

I'm a little confused about the internal uses for directories.  Is it
necessarily kept on disk?  In cases it's not, then there's not the same
performance issue, right?  Is there any risk these patches make
performance slightly worse in that case?

> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes.

On those filesystems that's done for both files and directories, right?

--b.

> That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]
> 
> It turns out though that none of these users of i_version require that
> i_version change on every change to the file. The only real requirement
> is that it be different if _something_ changed since the last time we
> queried for it. [2]
> 
> So, if we simply keep track of when something queries the value, we
> can avoid bumping the counter and that metadata update.
> 
> This patchset implements this:
> 
> It starts with some small cleanup patches to just remove any mention of
> the i_version counter in filesystems that don't actually use it.
> 
> Then, we add a new set of static inlines for managing the counter. The
> initial version should work identically to what we have now. Then, all
> of the remaining filesystems that use i_version are converted to the new
> inlines.
> 
> Once that's in place, we switch to a new implementation that allows us
> to track readers of i_version counter, and only bump it when it's
> necessary or convenient (i.e. we're going to disk anyway).
> 
> The final patch switches from a scheme that uses the i_lock to serialize
> the counter updates during write to an atomic64_t. That's a wash
> performance-wise in my testing, but I like not having to take the i_lock
> down where it could end up nested inside other locks.
> 
> With this, we reduce inode metadata updates across all 3 filesystems
> down to roughly the frequency of the timestamp granularity, particularly
> when it's not being queried (the vastly common case).
> 
> The pessimal workload here is 1 byte writes, and it helps that
> significantly. Of course, that's not a real-world workload.
> 
> A tiobench-example.fio workload also shows some modest performance
> gains, and I've gotten mails from the kernel test robot that show some
> significant performance gains on some microbenchmarks (case-msync-mt in
> the vm-scalability testsuite to be specific).
> 
> I'm happy to run other workloads if anyone can suggest them.
> 
> At this point, the patchset works and does what it's expected to do in
> my own testing. It seems like it's at least a modest performance win
> across all 3 major disk-based filesystems. It may also encourage others
> to implement i_version as well since it reduces that cost.
> 
> Is this an avenue that's worthwhile to pursue?
> 
> Note that I think we may have other changes coming in the future that
> will make this sort of cleanup necessary anyway. I'd like to plug in the
> Ceph change attribute here eventually, and that will require something
> like this anyway.
> 
> Thoughts, comments and suggestions are welcome...
> 
> ---
> 
> [1]: On ext4 it must be turned on with the i_version mount option,
>  mostly due to fears of incurring this impact, AFAICT.
> 
> [2]: NFS also recommends that it appear to increase in value over time, so
>  that clients can discard metadata updates that are older than ones
>  they've already seen.
> 
> Jeff Layton (30):
>   lustre: don't set f_version in ll_readdir
>   ecryptfs: remove unnecessary i_version bump
>   ceph: remove the bump of i_version
>   f2fs: don't bother setting i_version
>   hpfs: don't bother with the i_version counter
>   jfs: remove initialization of i_version counter
>   nilfs2: remove inode->i_version initialization
>   orangefs: remove initialization of i_version
>   reiserfs: remove unneeded i_version bump
>   ntfs: remove i_version handling
>   fs: new API for handling i_version
>   fat: convert to new i_version API
>   affs: convert to new i_version API
>   afs: convert to new i_version API
>   btrfs: convert to new i_version API
>   exofs: switch to new i_version API
>

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-22 Thread Jeff Layton

On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > 
> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes. That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> 
> Do you have numbers to justify these changes?

I have numbers. As to whether they justify the changes, I'm not sure.
This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
XFS, with this fio jobfile:

8<--
[global]
direct=0
size=2g
filesize=512m
bsrange=1-1
timeout=60
numjobs=1
directory=/mnt/scratch

[f1]
filename=randwrite
rw=randwrite
8<--

Unpatched kernel:
  WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=6msec, 
maxt=6msec

Patched kernel:
  WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=6msec, 
maxt=6msec

So quite a difference there and it's pretty consistent across runs. If I
change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
variation between the two doesn't seem to be significant (numbers vary
as much between runs on the same kernels and are roughly the same).

Playing with buffered I/O sizes between 1 byte and 4k shows that as the
I/O sizes get larger, this makes less difference (which is what I'd
expect).

Previous testing with ext4 shows roughly the same results. btrfs shows
some benefit here but significantly less than with ext4 or xfs. Not sure
why that is yet -- maybe CoW effects?

That said, I don't have a great test rig for this. I'm using VMs with a
dedicated LVM volume that's on a random SSD I had laying around. It
could use testing on a wider set of configurations and workloads.

I was also hoping that others may have workloads that they think might
be (postively or negatively) affected by these changes. If you can think
of any in particular, then I'm interested to hear about them.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-22 Thread Jeff Layton

On Thu, 2016-12-22 at 00:45 -0800, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> > 
> > Only btrfs, ext4, and xfs implement it for data changes. Because of
> > this, these filesystems must log the inode to disk whenever the
> > i_version counter changes. That has a non-zero performance impact,
> > especially on write-heavy workloads, because we end up dirtying the
> > inode metadata on every write, not just when the times change. [1]
> 
> Do you have numbers to justify these changes?

I have numbers. As to whether they justify the changes, I'm not sure.
This helps a lot on a (admittedly nonsensical) 1-byte write workload. On
XFS, with this fio jobfile:

8<--
[global]
direct=0
size=2g
filesize=512m
bsrange=1-1
timeout=60
numjobs=1
directory=/mnt/scratch

[f1]
filename=randwrite
rw=randwrite
8<--

Unpatched kernel:
  WRITE: io=7707KB, aggrb=128KB/s, minb=128KB/s, maxb=128KB/s, mint=6msec, 
maxt=6msec

Patched kernel:
  WRITE: io=12701KB, aggrb=211KB/s, minb=211KB/s, maxb=211KB/s, mint=6msec, 
maxt=6msec

So quite a difference there and it's pretty consistent across runs. If I
change the jobfile to have "direct=1" and "bsrange=4k-4k", then any
variation between the two doesn't seem to be significant (numbers vary
as much between runs on the same kernels and are roughly the same).

Playing with buffered I/O sizes between 1 byte and 4k shows that as the
I/O sizes get larger, this makes less difference (which is what I'd
expect).

Previous testing with ext4 shows roughly the same results. btrfs shows
some benefit here but significantly less than with ext4 or xfs. Not sure
why that is yet -- maybe CoW effects?

That said, I don't have a great test rig for this. I'm using VMs with a
dedicated LVM volume that's on a random SSD I had laying around. It
could use testing on a wider set of configurations and workloads.

I was also hoping that others may have workloads that they think might
be (postively or negatively) affected by these changes. If you can think
of any in particular, then I'm interested to hear about them.

-- 
Jeff Layton

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-22 Thread Christoph Hellwig

On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes. That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]

Do you have numbers to justify these changes?

Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-22 Thread Christoph Hellwig

On Wed, Dec 21, 2016 at 12:03:17PM -0500, Jeff Layton wrote:
> Only btrfs, ext4, and xfs implement it for data changes. Because of
> this, these filesystems must log the inode to disk whenever the
> i_version counter changes. That has a non-zero performance impact,
> especially on write-heavy workloads, because we end up dirtying the
> inode metadata on every write, not just when the times change. [1]

Do you have numbers to justify these changes?

[RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-21 Thread Jeff Layton

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it
if no one is looking at it. We can also clean up the code to prepare
to eventually expose this value via statx().

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION
set. IMA will also use it (though it's not clear to me that that
works 100% -- no check for MS_I_VERSION there).

Only btrfs, ext4, and xfs implement it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]

It turns out though that none of these users of i_version require that
i_version change on every change to the file. The only real requirement
is that it be different if _something_ changed since the last time we
queried for it. [2]

So, if we simply keep track of when something queries the value, we
can avoid bumping the counter and that metadata update.

This patchset implements this:

It starts with some small cleanup patches to just remove any mention of
the i_version counter in filesystems that don't actually use it.

Then, we add a new set of static inlines for managing the counter. The
initial version should work identically to what we have now. Then, all
of the remaining filesystems that use i_version are converted to the new
inlines.

Once that's in place, we switch to a new implementation that allows us
to track readers of i_version counter, and only bump it when it's
necessary or convenient (i.e. we're going to disk anyway).

The final patch switches from a scheme that uses the i_lock to serialize
the counter updates during write to an atomic64_t. That's a wash
performance-wise in my testing, but I like not having to take the i_lock
down where it could end up nested inside other locks.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

The pessimal workload here is 1 byte writes, and it helps that
significantly. Of course, that's not a real-world workload.

A tiobench-example.fio workload also shows some modest performance
gains, and I've gotten mails from the kernel test robot that show some
significant performance gains on some microbenchmarks (case-msync-mt in
the vm-scalability testsuite to be specific).

I'm happy to run other workloads if anyone can suggest them.

At this point, the patchset works and does what it's expected to do in
my own testing. It seems like it's at least a modest performance win
across all 3 major disk-based filesystems. It may also encourage others
to implement i_version as well since it reduces that cost.

Is this an avenue that's worthwhile to pursue?

Note that I think we may have other changes coming in the future that
will make this sort of cleanup necessary anyway. I'd like to plug in the
Ceph change attribute here eventually, and that will require something
like this anyway.

Thoughts, comments and suggestions are welcome...

---

[1]: On ext4 it must be turned on with the i_version mount option,
 mostly due to fears of incurring this impact, AFAICT.

[2]: NFS also recommends that it appear to increase in value over time, so
 that clients can discard metadata updates that are older than ones
 they've already seen.

Jeff Layton (30):
  lustre: don't set f_version in ll_readdir
  ecryptfs: remove unnecessary i_version bump
  ceph: remove the bump of i_version
  f2fs: don't bother setting i_version
  hpfs: don't bother with the i_version counter
  jfs: remove initialization of i_version counter
  nilfs2: remove inode->i_version initialization
  orangefs: remove initialization of i_version
  reiserfs: remove unneeded i_version bump
  ntfs: remove i_version handling
  fs: new API for handling i_version
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: add a "force" parameter to inode_inc_iversion
  fs: only set S_VERSION when updating times if it has been queried
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: track whether the

[RFC PATCH v1 00/30] fs: inode->i_version rework and optimization

2016-12-21 Thread Jeff Layton

tl;dr: I think we can greatly reduce the cost of the inode->i_version
counter, by exploiting the fact that we don't need to increment it
if no one is looking at it. We can also clean up the code to prepare
to eventually expose this value via statx().

The inode->i_version field is supposed to be a value that changes
whenever there is any data or metadata change to the inode. Some
filesystems use it internally to detect directory changes during
readdir. knfsd will use it if the filesystem has MS_I_VERSION
set. IMA will also use it (though it's not clear to me that that
works 100% -- no check for MS_I_VERSION there).

Only btrfs, ext4, and xfs implement it for data changes. Because of
this, these filesystems must log the inode to disk whenever the
i_version counter changes. That has a non-zero performance impact,
especially on write-heavy workloads, because we end up dirtying the
inode metadata on every write, not just when the times change. [1]

It turns out though that none of these users of i_version require that
i_version change on every change to the file. The only real requirement
is that it be different if _something_ changed since the last time we
queried for it. [2]

So, if we simply keep track of when something queries the value, we
can avoid bumping the counter and that metadata update.

This patchset implements this:

It starts with some small cleanup patches to just remove any mention of
the i_version counter in filesystems that don't actually use it.

Then, we add a new set of static inlines for managing the counter. The
initial version should work identically to what we have now. Then, all
of the remaining filesystems that use i_version are converted to the new
inlines.

Once that's in place, we switch to a new implementation that allows us
to track readers of i_version counter, and only bump it when it's
necessary or convenient (i.e. we're going to disk anyway).

The final patch switches from a scheme that uses the i_lock to serialize
the counter updates during write to an atomic64_t. That's a wash
performance-wise in my testing, but I like not having to take the i_lock
down where it could end up nested inside other locks.

With this, we reduce inode metadata updates across all 3 filesystems
down to roughly the frequency of the timestamp granularity, particularly
when it's not being queried (the vastly common case).

The pessimal workload here is 1 byte writes, and it helps that
significantly. Of course, that's not a real-world workload.

A tiobench-example.fio workload also shows some modest performance
gains, and I've gotten mails from the kernel test robot that show some
significant performance gains on some microbenchmarks (case-msync-mt in
the vm-scalability testsuite to be specific).

I'm happy to run other workloads if anyone can suggest them.

At this point, the patchset works and does what it's expected to do in
my own testing. It seems like it's at least a modest performance win
across all 3 major disk-based filesystems. It may also encourage others
to implement i_version as well since it reduces that cost.

Is this an avenue that's worthwhile to pursue?

Note that I think we may have other changes coming in the future that
will make this sort of cleanup necessary anyway. I'd like to plug in the
Ceph change attribute here eventually, and that will require something
like this anyway.

Thoughts, comments and suggestions are welcome...

---

[1]: On ext4 it must be turned on with the i_version mount option,
 mostly due to fears of incurring this impact, AFAICT.

[2]: NFS also recommends that it appear to increase in value over time, so
 that clients can discard metadata updates that are older than ones
 they've already seen.

Jeff Layton (30):
  lustre: don't set f_version in ll_readdir
  ecryptfs: remove unnecessary i_version bump
  ceph: remove the bump of i_version
  f2fs: don't bother setting i_version
  hpfs: don't bother with the i_version counter
  jfs: remove initialization of i_version counter
  nilfs2: remove inode->i_version initialization
  orangefs: remove initialization of i_version
  reiserfs: remove unneeded i_version bump
  ntfs: remove i_version handling
  fs: new API for handling i_version
  fat: convert to new i_version API
  affs: convert to new i_version API
  afs: convert to new i_version API
  btrfs: convert to new i_version API
  exofs: switch to new i_version API
  ext2: convert to new i_version API
  ext4: convert to new i_version API
  nfs: convert to new i_version API
  nfsd: convert to new i_version API
  ocfs2: convert to new i_version API
  ufs: use new i_version API
  xfs: convert to new i_version API
  IMA: switch IMA over to new i_version API
  fs: add a "force" parameter to inode_inc_iversion
  fs: only set S_VERSION when updating times if it has been queried
  xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need
incrementing
  btrfs: only dirty the inode in btrfs_update_time if something was
changed
  fs: track whether the

96 matches

Mail list logo