Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-14 Thread Jeff Layton
On Thu, 2017-12-14 at 13:17 +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 07:10:22PM -0500, Jeff Layton wrote:
> > On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> > > So now I've looked at the last patch .
> > > 
> > > On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > > > From: Jeff Layton 
> > > > > 
> > > > > Signed-off-by: Jeff Layton 
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > > > >  fs/xfs/xfs_icache.c   | 4 ++--
> > > > >  fs/xfs/xfs_inode.c| 2 +-
> > > > >  fs/xfs/xfs_inode_item.c   | 2 +-
> > > > >  fs/xfs/xfs_trans_inode.c  | 2 +-
> > > > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c 
> > > > > b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > index 6b7989038d75..6b47de201391 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > > > >   to->di_flags= be16_to_cpu(from->di_flags);
> > > > >  
> > > > >   if (to->di_version == 3) {
> > > > > - inode->i_version = be64_to_cpu(from->di_changecount);
> > > > > + inode_set_iversion_queried(inode,
> > > > > +
> > > > > be64_to_cpu(from->di_changecount));
> > > > 
> > > > So we use the "kernel managed" (really not sure what that means)
> > > > set function here to read it off disk, but...
> > > 
> > > This stores the value from disk in the incore inode as "val << 1",
> > > then sets the lowest bit to indicate that it has been "queried"
> > > so that it will be incremented on the first modification.
> > > 
> > > Why do we initialise values read from disk as "queried"? This means
> > > the i_version will change once every time it's brought into memory
> > > and modified, regardless of whether anyone is looking at it. What
> > > purpose does this serve?
> > > 
> > 
> > I don't think we want to store the QUERIED bit.
> > 
> > It's always possible that we crash at an inopportune time and a query
> > happened vs. this value before this thing hit the backing store.
> > 
> > If we always set the queried bit when we load it from disk, then we know
> > that that scenario is harmless, at the negligible expense of having to
> > bump it on the first write.
> 
> Reasonable. Needs documentation.
> 

Will do.

FWIW, there's another reason to do it this way too: backward
compatibility. If we don't try to store the queried bit then we should
be able to go back and forth between legacy kernels and the ones with
the new i_version handling without any trouble. The older kernels will
just bump the count more frequently.

> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 801274126648..be6d87980dd5 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > > > >   ip->i_d.di_flags = 0;
> > > > >  
> > > > >   if (ip->i_d.di_version == 3) {
> > > > > - inode->i_version = 1;
> > > > > + inode_set_iversion(inode, 1);
> > > > 
> > > > But here you are using the "filesystem managed" mdoe to set the
> > > > new value. Why? How is this any different from reading the value
> > > > off disk and setting it?
> > > 
> > > Still don't understand why this is different to reading the inode
> > > from disk
> > 
> > This is a allocating a brand new, never before seen inode. There's no
> > way this i_version could have ever been seen, so there's no need to flag
> > it as queried.
> 
> More documentation. People are going to need to know this stuff to
> be able to implement/maintain this stuff in working order - it's no
> longer a simple, obvious "just increment the counter on
> modification" variable and that has potential ramifications for
> filesystems that store this on disk.
> 
> 

Definitely. I'm finding that documenting this has been the hardest part.

Thanks for the review so far!
-- 
Jeff Layton 


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-14 Thread Jeff Layton
On Thu, 2017-12-14 at 13:17 +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 07:10:22PM -0500, Jeff Layton wrote:
> > On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> > > So now I've looked at the last patch .
> > > 
> > > On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > > > From: Jeff Layton 
> > > > > 
> > > > > Signed-off-by: Jeff Layton 
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > > > >  fs/xfs/xfs_icache.c   | 4 ++--
> > > > >  fs/xfs/xfs_inode.c| 2 +-
> > > > >  fs/xfs/xfs_inode_item.c   | 2 +-
> > > > >  fs/xfs/xfs_trans_inode.c  | 2 +-
> > > > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c 
> > > > > b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > index 6b7989038d75..6b47de201391 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > > > >   to->di_flags= be16_to_cpu(from->di_flags);
> > > > >  
> > > > >   if (to->di_version == 3) {
> > > > > - inode->i_version = be64_to_cpu(from->di_changecount);
> > > > > + inode_set_iversion_queried(inode,
> > > > > +
> > > > > be64_to_cpu(from->di_changecount));
> > > > 
> > > > So we use the "kernel managed" (really not sure what that means)
> > > > set function here to read it off disk, but...
> > > 
> > > This stores the value from disk in the incore inode as "val << 1",
> > > then sets the lowest bit to indicate that it has been "queried"
> > > so that it will be incremented on the first modification.
> > > 
> > > Why do we initialise values read from disk as "queried"? This means
> > > the i_version will change once every time it's brought into memory
> > > and modified, regardless of whether anyone is looking at it. What
> > > purpose does this serve?
> > > 
> > 
> > I don't think we want to store the QUERIED bit.
> > 
> > It's always possible that we crash at an inopportune time and a query
> > happened vs. this value before this thing hit the backing store.
> > 
> > If we always set the queried bit when we load it from disk, then we know
> > that that scenario is harmless, at the negligible expense of having to
> > bump it on the first write.
> 
> Reasonable. Needs documentation.
> 

Will do.

FWIW, there's another reason to do it this way too: backward
compatibility. If we don't try to store the queried bit then we should
be able to go back and forth between legacy kernels and the ones with
the new i_version handling without any trouble. The older kernels will
just bump the count more frequently.

> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 801274126648..be6d87980dd5 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > > > >   ip->i_d.di_flags = 0;
> > > > >  
> > > > >   if (ip->i_d.di_version == 3) {
> > > > > - inode->i_version = 1;
> > > > > + inode_set_iversion(inode, 1);
> > > > 
> > > > But here you are using the "filesystem managed" mdoe to set the
> > > > new value. Why? How is this any different from reading the value
> > > > off disk and setting it?
> > > 
> > > Still don't understand why this is different to reading the inode
> > > from disk
> > 
> > This is a allocating a brand new, never before seen inode. There's no
> > way this i_version could have ever been seen, so there's no need to flag
> > it as queried.
> 
> More documentation. People are going to need to know this stuff to
> be able to implement/maintain this stuff in working order - it's no
> longer a simple, obvious "just increment the counter on
> modification" variable and that has potential ramifications for
> filesystems that store this on disk.
> 
> 

Definitely. I'm finding that documenting this has been the hardest part.

Thanks for the review so far!
-- 
Jeff Layton 


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Dave Chinner
On Wed, Dec 13, 2017 at 07:10:22PM -0500, Jeff Layton wrote:
> On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> > So now I've looked at the last patch .
> > 
> > On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > > From: Jeff Layton 
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > > ---
> > > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > > >  fs/xfs/xfs_icache.c   | 4 ++--
> > > >  fs/xfs/xfs_inode.c| 2 +-
> > > >  fs/xfs/xfs_inode_item.c   | 2 +-
> > > >  fs/xfs/xfs_trans_inode.c  | 2 +-
> > > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c 
> > > > b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > index 6b7989038d75..6b47de201391 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > > > to->di_flags= be16_to_cpu(from->di_flags);
> > > >  
> > > > if (to->di_version == 3) {
> > > > -   inode->i_version = be64_to_cpu(from->di_changecount);
> > > > +   inode_set_iversion_queried(inode,
> > > > +  
> > > > be64_to_cpu(from->di_changecount));
> > > 
> > > So we use the "kernel managed" (really not sure what that means)
> > > set function here to read it off disk, but...
> > 
> > This stores the value from disk in the incore inode as "val << 1",
> > then sets the lowest bit to indicate that it has been "queried"
> > so that it will be incremented on the first modification.
> > 
> > Why do we initialise values read from disk as "queried"? This means
> > the i_version will change once every time it's brought into memory
> > and modified, regardless of whether anyone is looking at it. What
> > purpose does this serve?
> > 
> 
> I don't think we want to store the QUERIED bit.
> 
> It's always possible that we crash at an inopportune time and a query
> happened vs. this value before this thing hit the backing store.
> 
> If we always set the queried bit when we load it from disk, then we know
> that that scenario is harmless, at the negligible expense of having to
> bump it on the first write.

Reasonable. Needs documentation.

> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 801274126648..be6d87980dd5 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > > > ip->i_d.di_flags = 0;
> > > >  
> > > > if (ip->i_d.di_version == 3) {
> > > > -   inode->i_version = 1;
> > > > +   inode_set_iversion(inode, 1);
> > > 
> > > But here you are using the "filesystem managed" mdoe to set the
> > > new value. Why? How is this any different from reading the value
> > > off disk and setting it?
> > 
> > Still don't understand why this is different to reading the inode
> > from disk
> 
> This is a allocating a brand new, never before seen inode. There's no
> way this i_version could have ever been seen, so there's no need to flag
> it as queried.

More documentation. People are going to need to know this stuff to
be able to implement/maintain this stuff in working order - it's no
longer a simple, obvious "just increment the counter on
modification" variable and that has potential ramifications for
filesystems that store this on disk.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Dave Chinner
On Wed, Dec 13, 2017 at 07:10:22PM -0500, Jeff Layton wrote:
> On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> > So now I've looked at the last patch .
> > 
> > On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > > From: Jeff Layton 
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > > ---
> > > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > > >  fs/xfs/xfs_icache.c   | 4 ++--
> > > >  fs/xfs/xfs_inode.c| 2 +-
> > > >  fs/xfs/xfs_inode_item.c   | 2 +-
> > > >  fs/xfs/xfs_trans_inode.c  | 2 +-
> > > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c 
> > > > b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > index 6b7989038d75..6b47de201391 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > > > to->di_flags= be16_to_cpu(from->di_flags);
> > > >  
> > > > if (to->di_version == 3) {
> > > > -   inode->i_version = be64_to_cpu(from->di_changecount);
> > > > +   inode_set_iversion_queried(inode,
> > > > +  
> > > > be64_to_cpu(from->di_changecount));
> > > 
> > > So we use the "kernel managed" (really not sure what that means)
> > > set function here to read it off disk, but...
> > 
> > This stores the value from disk in the incore inode as "val << 1",
> > then sets the lowest bit to indicate that it has been "queried"
> > so that it will be incremented on the first modification.
> > 
> > Why do we initialise values read from disk as "queried"? This means
> > the i_version will change once every time it's brought into memory
> > and modified, regardless of whether anyone is looking at it. What
> > purpose does this serve?
> > 
> 
> I don't think we want to store the QUERIED bit.
> 
> It's always possible that we crash at an inopportune time and a query
> happened vs. this value before this thing hit the backing store.
> 
> If we always set the queried bit when we load it from disk, then we know
> that that scenario is harmless, at the negligible expense of having to
> bump it on the first write.

Reasonable. Needs documentation.

> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 801274126648..be6d87980dd5 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > > > ip->i_d.di_flags = 0;
> > > >  
> > > > if (ip->i_d.di_version == 3) {
> > > > -   inode->i_version = 1;
> > > > +   inode_set_iversion(inode, 1);
> > > 
> > > But here you are using the "filesystem managed" mdoe to set the
> > > new value. Why? How is this any different from reading the value
> > > off disk and setting it?
> > 
> > Still don't understand why this is different to reading the inode
> > from disk
> 
> This is a allocating a brand new, never before seen inode. There's no
> way this i_version could have ever been seen, so there's no need to flag
> it as queried.

More documentation. People are going to need to know this stuff to
be able to implement/maintain this stuff in working order - it's no
longer a simple, obvious "just increment the counter on
modification" variable and that has potential ramifications for
filesystems that store this on disk.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Jeff Layton
On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> So now I've looked at the last patch .
> 
> On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > From: Jeff Layton 
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > >  fs/xfs/xfs_icache.c   | 4 ++--
> > >  fs/xfs/xfs_inode.c| 2 +-
> > >  fs/xfs/xfs_inode_item.c   | 2 +-
> > >  fs/xfs/xfs_trans_inode.c  | 2 +-
> > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 6b7989038d75..6b47de201391 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > >   to->di_flags= be16_to_cpu(from->di_flags);
> > >  
> > >   if (to->di_version == 3) {
> > > - inode->i_version = be64_to_cpu(from->di_changecount);
> > > + inode_set_iversion_queried(inode,
> > > +be64_to_cpu(from->di_changecount));
> > 
> > So we use the "kernel managed" (really not sure what that means)
> > set function here to read it off disk, but...
> 
> This stores the value from disk in the incore inode as "val << 1",
> then sets the lowest bit to indicate that it has been "queried"
> so that it will be incremented on the first modification.
> 
> Why do we initialise values read from disk as "queried"? This means
> the i_version will change once every time it's brought into memory
> and modified, regardless of whether anyone is looking at it. What
> purpose does this serve?
> 

I don't think we want to store the QUERIED bit.

It's always possible that we crash at an inopportune time and a query
happened vs. this value before this thing hit the backing store.

If we always set the queried bit when we load it from disk, then we know
that that scenario is harmless, at the negligible expense of having to
bump it on the first write.

> > >   to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> > >   to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> > >   to->di_flags2 = be64_to_cpu(from->di_flags2);
> > > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> > >   to->di_flags = cpu_to_be16(from->di_flags);
> > >  
> > >   if (from->di_version == 3) {
> > > - to->di_changecount = cpu_to_be64(inode->i_version);
> > > + to->di_changecount = 
> > > cpu_to_be64(inode_peek_iversion_raw(inode));
> > 
> > ... use the raw access mode to put it back on disk.
> 
> This writes the current inode->i_version value directly to disk,
> including the "queried" flag.
> 
> Hence every time this inode cycles through memory and is modified,
> we essentially shift the on-disk i_version value upwards by 1 slot
> (i.e. double it's value) when we read it back in from disk.
> 
> Seems like a bug - this is not a monotonically increasing counter
> anymore - after ~60 modification cycles through memory it's going to
> have an practically random value when pulled in off disk, not a
> slowly increasing value.
> 

Good catch. That's definitely a bug. I'll fix it and test again. This
new API went through several iterations. I'll go back through it in more
detail.

I don't think it'll probably affect the performance, but I'll test again
to be sure.

> > >   to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> > >   to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >   to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 43005fbe8b1e..4838462616fd 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> > >   int error;
> > >   uint32_tnlink = inode->i_nlink;
> > >   uint32_tgeneration = inode->i_generation;
> > > - uint64_tversion = inode->i_version;
> > > + uint64_tversion = inode_peek_iversion_raw(inode);
> > >   umode_t mode = inode->i_mode;
> > >  
> > >   error = inode_init_always(mp->m_super, inode);
> > >  
> > >   set_nlink(inode, nlink);
> > >   inode->i_generation = generation;
> > > - inode->i_version = version;
> > > + inode_set_iversion_queried(inode, version);
> > 
> > Again - raw mode to read, kernel managed to set.
> 
> This, again, will double the i_version value. Shouldn't all the XFS
> code just be using inode_peek_iversion(), not the _raw variant?
> 

Yes, indeed. Will fix.

> > 
> > >   inode->i_mode = mode;
> > >   return error;
> > >  }
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 801274126648..be6d87980dd5 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > >   ip->i_d.di_flags = 0;
> > >  
> > >   if (ip->i_d.di_version == 

Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Jeff Layton
On Thu, 2017-12-14 at 10:25 +1100, Dave Chinner wrote:
> So now I've looked at the last patch .
> 
> On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> > On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > > From: Jeff Layton 
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> > >  fs/xfs/xfs_icache.c   | 4 ++--
> > >  fs/xfs/xfs_inode.c| 2 +-
> > >  fs/xfs/xfs_inode_item.c   | 2 +-
> > >  fs/xfs/xfs_trans_inode.c  | 2 +-
> > >  5 files changed, 8 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 6b7989038d75..6b47de201391 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > >   to->di_flags= be16_to_cpu(from->di_flags);
> > >  
> > >   if (to->di_version == 3) {
> > > - inode->i_version = be64_to_cpu(from->di_changecount);
> > > + inode_set_iversion_queried(inode,
> > > +be64_to_cpu(from->di_changecount));
> > 
> > So we use the "kernel managed" (really not sure what that means)
> > set function here to read it off disk, but...
> 
> This stores the value from disk in the incore inode as "val << 1",
> then sets the lowest bit to indicate that it has been "queried"
> so that it will be incremented on the first modification.
> 
> Why do we initialise values read from disk as "queried"? This means
> the i_version will change once every time it's brought into memory
> and modified, regardless of whether anyone is looking at it. What
> purpose does this serve?
> 

I don't think we want to store the QUERIED bit.

It's always possible that we crash at an inopportune time and a query
happened vs. this value before this thing hit the backing store.

If we always set the queried bit when we load it from disk, then we know
that that scenario is harmless, at the negligible expense of having to
bump it on the first write.

> > >   to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> > >   to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> > >   to->di_flags2 = be64_to_cpu(from->di_flags2);
> > > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> > >   to->di_flags = cpu_to_be16(from->di_flags);
> > >  
> > >   if (from->di_version == 3) {
> > > - to->di_changecount = cpu_to_be64(inode->i_version);
> > > + to->di_changecount = 
> > > cpu_to_be64(inode_peek_iversion_raw(inode));
> > 
> > ... use the raw access mode to put it back on disk.
> 
> This writes the current inode->i_version value directly to disk,
> including the "queried" flag.
> 
> Hence every time this inode cycles through memory and is modified,
> we essentially shift the on-disk i_version value upwards by 1 slot
> (i.e. double it's value) when we read it back in from disk.
> 
> Seems like a bug - this is not a monotonically increasing counter
> anymore - after ~60 modification cycles through memory it's going to
> have an practically random value when pulled in off disk, not a
> slowly increasing value.
> 

Good catch. That's definitely a bug. I'll fix it and test again. This
new API went through several iterations. I'll go back through it in more
detail.

I don't think it'll probably affect the performance, but I'll test again
to be sure.

> > >   to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> > >   to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >   to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > index 43005fbe8b1e..4838462616fd 100644
> > > --- a/fs/xfs/xfs_icache.c
> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> > >   int error;
> > >   uint32_tnlink = inode->i_nlink;
> > >   uint32_tgeneration = inode->i_generation;
> > > - uint64_tversion = inode->i_version;
> > > + uint64_tversion = inode_peek_iversion_raw(inode);
> > >   umode_t mode = inode->i_mode;
> > >  
> > >   error = inode_init_always(mp->m_super, inode);
> > >  
> > >   set_nlink(inode, nlink);
> > >   inode->i_generation = generation;
> > > - inode->i_version = version;
> > > + inode_set_iversion_queried(inode, version);
> > 
> > Again - raw mode to read, kernel managed to set.
> 
> This, again, will double the i_version value. Shouldn't all the XFS
> code just be using inode_peek_iversion(), not the _raw variant?
> 

Yes, indeed. Will fix.

> > 
> > >   inode->i_mode = mode;
> > >   return error;
> > >  }
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 801274126648..be6d87980dd5 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -833,7 +833,7 @@ xfs_ialloc(
> > >   ip->i_d.di_flags = 0;
> > >  
> > >   if (ip->i_d.di_version == 3) {
> > > - inode->i_version = 

Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Dave Chinner

So now I've looked at the last patch .

On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> >  fs/xfs/xfs_icache.c   | 4 ++--
> >  fs/xfs/xfs_inode.c| 2 +-
> >  fs/xfs/xfs_inode_item.c   | 2 +-
> >  fs/xfs/xfs_trans_inode.c  | 2 +-
> >  5 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 6b7989038d75..6b47de201391 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > to->di_flags= be16_to_cpu(from->di_flags);
> >  
> > if (to->di_version == 3) {
> > -   inode->i_version = be64_to_cpu(from->di_changecount);
> > +   inode_set_iversion_queried(inode,
> > +  be64_to_cpu(from->di_changecount));
> 
> So we use the "kernel managed" (really not sure what that means)
> set function here to read it off disk, but...

This stores the value from disk in the incore inode as "val << 1",
then sets the lowest bit to indicate that it has been "queried"
so that it will be incremented on the first modification.

Why do we initialise values read from disk as "queried"? This means
the i_version will change once every time it's brought into memory
and modified, regardless of whether anyone is looking at it. What
purpose does this serve?

> > to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> > to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> > to->di_flags2 = be64_to_cpu(from->di_flags2);
> > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> > to->di_flags = cpu_to_be16(from->di_flags);
> >  
> > if (from->di_version == 3) {
> > -   to->di_changecount = cpu_to_be64(inode->i_version);
> > +   to->di_changecount = 
> > cpu_to_be64(inode_peek_iversion_raw(inode));
> 
> ... use the raw access mode to put it back on disk.

This writes the current inode->i_version value directly to disk,
including the "queried" flag.

Hence every time this inode cycles through memory and is modified,
we essentially shift the on-disk i_version value upwards by 1 slot
(i.e. double it's value) when we read it back in from disk.

Seems like a bug - this is not a monotonically increasing counter
anymore - after ~60 modification cycles through memory it's going to
have an practically random value when pulled in off disk, not a
slowly increasing value.

> > to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> > to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > to->di_flags2 = cpu_to_be64(from->di_flags2);
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 43005fbe8b1e..4838462616fd 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> > int error;
> > uint32_tnlink = inode->i_nlink;
> > uint32_tgeneration = inode->i_generation;
> > -   uint64_tversion = inode->i_version;
> > +   uint64_tversion = inode_peek_iversion_raw(inode);
> > umode_t mode = inode->i_mode;
> >  
> > error = inode_init_always(mp->m_super, inode);
> >  
> > set_nlink(inode, nlink);
> > inode->i_generation = generation;
> > -   inode->i_version = version;
> > +   inode_set_iversion_queried(inode, version);
> 
> Again - raw mode to read, kernel managed to set.

This, again, will double the i_version value. Shouldn't all the XFS
code just be using inode_peek_iversion(), not the _raw variant?

> 
> > inode->i_mode = mode;
> > return error;
> >  }
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 801274126648..be6d87980dd5 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -833,7 +833,7 @@ xfs_ialloc(
> > ip->i_d.di_flags = 0;
> >  
> > if (ip->i_d.di_version == 3) {
> > -   inode->i_version = 1;
> > +   inode_set_iversion(inode, 1);
> 
> But here you are using the "filesystem managed" mdoe to set the
> new value. Why? How is this any different from reading the value
> off disk and setting it?

Still don't understand why this is different to reading the inode
from disk

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Dave Chinner

So now I've looked at the last patch .

On Thu, Dec 14, 2017 at 09:48:37AM +1100, Dave Chinner wrote:
> On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> > From: Jeff Layton 
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
> >  fs/xfs/xfs_icache.c   | 4 ++--
> >  fs/xfs/xfs_inode.c| 2 +-
> >  fs/xfs/xfs_inode_item.c   | 2 +-
> >  fs/xfs/xfs_trans_inode.c  | 2 +-
> >  5 files changed, 8 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 6b7989038d75..6b47de201391 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -264,7 +264,8 @@ xfs_inode_from_disk(
> > to->di_flags= be16_to_cpu(from->di_flags);
> >  
> > if (to->di_version == 3) {
> > -   inode->i_version = be64_to_cpu(from->di_changecount);
> > +   inode_set_iversion_queried(inode,
> > +  be64_to_cpu(from->di_changecount));
> 
> So we use the "kernel managed" (really not sure what that means)
> set function here to read it off disk, but...

This stores the value from disk in the incore inode as "val << 1",
then sets the lowest bit to indicate that it has been "queried"
so that it will be incremented on the first modification.

Why do we initialise values read from disk as "queried"? This means
the i_version will change once every time it's brought into memory
and modified, regardless of whether anyone is looking at it. What
purpose does this serve?

> > to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
> > to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
> > to->di_flags2 = be64_to_cpu(from->di_flags2);
> > @@ -314,7 +315,7 @@ xfs_inode_to_disk(
> > to->di_flags = cpu_to_be16(from->di_flags);
> >  
> > if (from->di_version == 3) {
> > -   to->di_changecount = cpu_to_be64(inode->i_version);
> > +   to->di_changecount = 
> > cpu_to_be64(inode_peek_iversion_raw(inode));
> 
> ... use the raw access mode to put it back on disk.

This writes the current inode->i_version value directly to disk,
including the "queried" flag.

Hence every time this inode cycles through memory and is modified,
we essentially shift the on-disk i_version value upwards by 1 slot
(i.e. double it's value) when we read it back in from disk.

Seems like a bug - this is not a monotonically increasing counter
anymore - after ~60 modification cycles through memory it's going to
have an practically random value when pulled in off disk, not a
slowly increasing value.

> > to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
> > to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > to->di_flags2 = cpu_to_be64(from->di_flags2);
> > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > index 43005fbe8b1e..4838462616fd 100644
> > --- a/fs/xfs/xfs_icache.c
> > +++ b/fs/xfs/xfs_icache.c
> > @@ -293,14 +293,14 @@ xfs_reinit_inode(
> > int error;
> > uint32_tnlink = inode->i_nlink;
> > uint32_tgeneration = inode->i_generation;
> > -   uint64_tversion = inode->i_version;
> > +   uint64_tversion = inode_peek_iversion_raw(inode);
> > umode_t mode = inode->i_mode;
> >  
> > error = inode_init_always(mp->m_super, inode);
> >  
> > set_nlink(inode, nlink);
> > inode->i_generation = generation;
> > -   inode->i_version = version;
> > +   inode_set_iversion_queried(inode, version);
> 
> Again - raw mode to read, kernel managed to set.

This, again, will double the i_version value. Shouldn't all the XFS
code just be using inode_peek_iversion(), not the _raw variant?

> 
> > inode->i_mode = mode;
> > return error;
> >  }
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 801274126648..be6d87980dd5 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -833,7 +833,7 @@ xfs_ialloc(
> > ip->i_d.di_flags = 0;
> >  
> > if (ip->i_d.di_version == 3) {
> > -   inode->i_version = 1;
> > +   inode_set_iversion(inode, 1);
> 
> But here you are using the "filesystem managed" mdoe to set the
> new value. Why? How is this any different from reading the value
> off disk and setting it?

Still don't understand why this is different to reading the inode
from disk

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Dave Chinner
On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
>  fs/xfs/xfs_icache.c   | 4 ++--
>  fs/xfs/xfs_inode.c| 2 +-
>  fs/xfs/xfs_inode_item.c   | 2 +-
>  fs/xfs/xfs_trans_inode.c  | 2 +-
>  5 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6b7989038d75..6b47de201391 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -264,7 +264,8 @@ xfs_inode_from_disk(
>   to->di_flags= be16_to_cpu(from->di_flags);
>  
>   if (to->di_version == 3) {
> - inode->i_version = be64_to_cpu(from->di_changecount);
> + inode_set_iversion_queried(inode,
> +be64_to_cpu(from->di_changecount));

So we use the "kernel managed" (really not sure what that means)
set function here to read it off disk, but...

>   to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
>   to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
>   to->di_flags2 = be64_to_cpu(from->di_flags2);
> @@ -314,7 +315,7 @@ xfs_inode_to_disk(
>   to->di_flags = cpu_to_be16(from->di_flags);
>  
>   if (from->di_version == 3) {
> - to->di_changecount = cpu_to_be64(inode->i_version);
> + to->di_changecount = 
> cpu_to_be64(inode_peek_iversion_raw(inode));

... use the raw access mode to put it back on disk.

>   to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
>   to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>   to->di_flags2 = cpu_to_be64(from->di_flags2);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 43005fbe8b1e..4838462616fd 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -293,14 +293,14 @@ xfs_reinit_inode(
>   int error;
>   uint32_tnlink = inode->i_nlink;
>   uint32_tgeneration = inode->i_generation;
> - uint64_tversion = inode->i_version;
> + uint64_tversion = inode_peek_iversion_raw(inode);
>   umode_t mode = inode->i_mode;
>  
>   error = inode_init_always(mp->m_super, inode);
>  
>   set_nlink(inode, nlink);
>   inode->i_generation = generation;
> - inode->i_version = version;
> + inode_set_iversion_queried(inode, version);

Again - raw mode to read, kernel managed to set.

>   inode->i_mode = mode;
>   return error;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 801274126648..be6d87980dd5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -833,7 +833,7 @@ xfs_ialloc(
>   ip->i_d.di_flags = 0;
>  
>   if (ip->i_d.di_version == 3) {
> - inode->i_version = 1;
> + inode_set_iversion(inode, 1);

But here you are using the "filesystem managed" mdoe to set the
new value. Why? How is this any different from reading the value
off disk and setting it?

> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -117,7 +117,7 @@ xfs_trans_log_inode(
>*/
>   if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
>   IS_I_VERSION(VFS_I(ip))) {
> - VFS_I(ip)->i_version++;
> + inode_inc_iversion(VFS_I(ip));
>   flags |= XFS_ILOG_CORE;
>   }

And isn't this a case of "filesystem managed" iversion behaviour?

Basically, I can't make head or tail of why the different API
functions are used here

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 14/19] xfs: convert to new i_version API

2017-12-13 Thread Dave Chinner
On Wed, Dec 13, 2017 at 09:20:12AM -0500, Jeff Layton wrote:
> From: Jeff Layton 
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_inode_buf.c | 5 +++--
>  fs/xfs/xfs_icache.c   | 4 ++--
>  fs/xfs/xfs_inode.c| 2 +-
>  fs/xfs/xfs_inode_item.c   | 2 +-
>  fs/xfs/xfs_trans_inode.c  | 2 +-
>  5 files changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6b7989038d75..6b47de201391 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -264,7 +264,8 @@ xfs_inode_from_disk(
>   to->di_flags= be16_to_cpu(from->di_flags);
>  
>   if (to->di_version == 3) {
> - inode->i_version = be64_to_cpu(from->di_changecount);
> + inode_set_iversion_queried(inode,
> +be64_to_cpu(from->di_changecount));

So we use the "kernel managed" (really not sure what that means)
set function here to read it off disk, but...

>   to->di_crtime.t_sec = be32_to_cpu(from->di_crtime.t_sec);
>   to->di_crtime.t_nsec = be32_to_cpu(from->di_crtime.t_nsec);
>   to->di_flags2 = be64_to_cpu(from->di_flags2);
> @@ -314,7 +315,7 @@ xfs_inode_to_disk(
>   to->di_flags = cpu_to_be16(from->di_flags);
>  
>   if (from->di_version == 3) {
> - to->di_changecount = cpu_to_be64(inode->i_version);
> + to->di_changecount = 
> cpu_to_be64(inode_peek_iversion_raw(inode));

... use the raw access mode to put it back on disk.

>   to->di_crtime.t_sec = cpu_to_be32(from->di_crtime.t_sec);
>   to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>   to->di_flags2 = cpu_to_be64(from->di_flags2);
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 43005fbe8b1e..4838462616fd 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -293,14 +293,14 @@ xfs_reinit_inode(
>   int error;
>   uint32_tnlink = inode->i_nlink;
>   uint32_tgeneration = inode->i_generation;
> - uint64_tversion = inode->i_version;
> + uint64_tversion = inode_peek_iversion_raw(inode);
>   umode_t mode = inode->i_mode;
>  
>   error = inode_init_always(mp->m_super, inode);
>  
>   set_nlink(inode, nlink);
>   inode->i_generation = generation;
> - inode->i_version = version;
> + inode_set_iversion_queried(inode, version);

Again - raw mode to read, kernel managed to set.

>   inode->i_mode = mode;
>   return error;
>  }
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 801274126648..be6d87980dd5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -833,7 +833,7 @@ xfs_ialloc(
>   ip->i_d.di_flags = 0;
>  
>   if (ip->i_d.di_version == 3) {
> - inode->i_version = 1;
> + inode_set_iversion(inode, 1);

But here you are using the "filesystem managed" mdoe to set the
new value. Why? How is this any different from reading the value
off disk and setting it?

> +++ b/fs/xfs/xfs_trans_inode.c
> @@ -117,7 +117,7 @@ xfs_trans_log_inode(
>*/
>   if (!(ip->i_itemp->ili_item.li_desc->lid_flags & XFS_LID_DIRTY) &&
>   IS_I_VERSION(VFS_I(ip))) {
> - VFS_I(ip)->i_version++;
> + inode_inc_iversion(VFS_I(ip));
>   flags |= XFS_ILOG_CORE;
>   }

And isn't this a case of "filesystem managed" iversion behaviour?

Basically, I can't make head or tail of why the different API
functions are used here

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com