Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-20 Thread Jeff Layton
On Wed, 2023-09-20 at 16:53 +0200, Christian Brauner wrote:
> > You could put it behind an EXPERIMENTAL Kconfig option so that the
> > code stays in and can be used by the brave or foolish while it is
> > still being refined.
> 
> Given that the discussion has now fully gone back to the drawing board
> and this is a regression the honest thing to do is to revert the five
> patches that introduce the infrastructure:
> 
> ffb6cf19e063 ("fs: add infrastructure for multigrain timestamps")
> d48c33972916 ("tmpfs: add support for multigrain timestamps")
> e44df2664746 ("xfs: switch to multigrain timestamps")
> 0269b585868e ("ext4: switch to multigrain timestamps")
> 50e9ceef1d4f ("btrfs: convert to multigrain timestamps")
> 
> The conversion to helpers and cleanups are sane and should stay and can
> be used for any solution that gets built on top of it.
> 
> I'd appreciate a look at the branch here:
> git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git vfs.ctime.revert
> 
> survives xfstests.

I think that's probably the wisest course of action. I need some time to
ponder the options for this series anyway, and another cycle in next
wouldn't hurt.

The branch itself looks fine, but you might want to reverse the order of
the patches in case someone lands there in the middle of a bisect. IOW,
I think you want to revert the "convert to multigrain" patches before
you revert the infrastructure. 
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-20 Thread Jeff Layton
On Wed, 2023-09-20 at 14:48 +0200, Jan Kara wrote:
> On Wed 20-09-23 06:35:18, Jeff Layton wrote:
> > On Wed, 2023-09-20 at 12:17 +0200, Jan Kara wrote:
> > > If I were a sysadmin, I'd rather opt for something like
> > > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > > functionality). That should avoid the IO overhead of finegrained 
> > > timestamps
> > > as well and I'd know I can have problems with timestamps only after a
> > > system crash.
> > 
> > > I've just got another idea how we could solve the problem: Couldn't we
> > > always just report coarsegrained timestamp to userspace and provide access
> > > to finegrained value only to NFS which should know what it's doing?
> > > 
> > 
> > I think that'd be hard. First of all, where would we store the second
> > timestamp? We can't just truncate the fine-grained ones to come up with
> > a coarse-grained one. It might also be confusing having nfsd and local
> > filesystems present different attributes.
> 
> So what I had in mind (and I definitely miss all the NFS intricacies so the
> idea may be bogus) was that inode->i_ctime would be maintained exactly as
> is now. There will be new (kernel internal at least for now) STATX flag
> STATX_MULTIGRAIN_TS. fill_mg_cmtime() will return timestamp truncated to
> sb->s_time_gran unless STATX_MULTIGRAIN_TS is set. Hence unless you set
> STATX_MULTIGRAIN_TS, there is no difference in the returned timestamps
> compared to the state before multigrain timestamps were introduced. With
> STATX_MULTIGRAIN_TS we return full precision timestamp as stored in the
> inode. Then NFS in fh_fill_pre_attrs() and fh_fill_post_attrs() needs to
> make sure STATX_MULTIGRAIN_TS is set when calling vfs_getattr() to get
> multigrain time.

> I agree nfsd may now be presenting slightly different timestamps than user
> is able to see with stat(2) directly on the filesystem. But is that a
> problem? Essentially it is a similar solution as the mgtime mount option
> but now sysadmin doesn't have to decide on filesystem mount how to report
> timestamps but the stat caller knowingly opts into possibly inconsistent
> (among files) but high precision timestamps. And in the particular NFS
> usecase where stat is called all the time anyway, timestamps will likely
> even be consistent among files.
> 

I like this idea...

Would we also need to raise sb->s_time_gran to something corresponding
to HZ on these filesystems? If we truncate the timestamps at a
granularity corresponding to HZ before presenting them via statx and the
like then that should work around the problem with programs that compare
timestamps between inodes.

With NFSv4, when a filesystem doesn't report a STATX_CHANGE_COOKIE, nfsd
will fake one up using the ctime. It's fine for that to use a full fine-
grained timestamp since we don't expect to be able to compare that value
with one of a different inode.

I think we'd want nfsd to present the mtime/ctime values as truncated,
just like we would with a local fs. We could hit the same problem of an
earlier-looking timestamp with NFS if we try to present the actual fine-
grained values to the clients. IOW, I'm convinced that we need to avoid
this behavior in most situations.

If we do this, then we technically don't need the mount option either.
We could still add it though, and have it govern whether fill_mg_cmtime
truncates the timestamps before storing them in the kstat.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-20 Thread Jeff Layton
On Wed, 2023-09-20 at 14:08 +0200, Christian Brauner wrote:
> > I wasn't proposing to do that work for v6.6. For that, we absolutely
> > either need the mount option or to just revert the mgtime conversions.
> 
> This sounds like you want me to do a full-on revert of your series but
> why? The conversion and changes support an actual use-case and are fine.
> It's a matter of whether we unconditionally expose it to users or not.
> 

I don't, actually. I'm just mentioning that it's possible if we find the
mount option to be unpalatable.

> @Jan, what do you think?
> 
> > My plan was to take a stab at doing this for a later kernel release.
> 
> Ok.

If it works out, then we may be able to eventually remove the mount
option, but that is a separate project altogether.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-20 Thread Jeff Layton
On Wed, 2023-09-20 at 13:48 +0200, Christian Brauner wrote:
> > > > While we initially thought we can do this unconditionally it turns out
> > > > that this might break existing workloads that rely on timestamps in very
> > > > specific ways and we always knew this was a possibility. Move
> > > > multi-grain timestamps behind a vfs mount option.
> > > 
> > > Surely this is a safe choice as it moves the responsibility to the 
> > > sysadmin
> > > and the cases where finegrained timestamps are required. But I kind of
> > > wonder how is the sysadmin going to decide whether mgtime is safe for his
> > > system or not? Because the possible breakage needn't be obvious at the
> > > first sight...
> > > 
> > 
> > That's the main reason I really didn't want to go with a mount option.
> > Documenting that may be difficult. While there is some pessimism around
> > it, I may still take a stab at just advancing the coarse clock whenever
> > we fetch a fine-grained timestamp. It'd be nice to remove this option in
> > the future if that turns out to be feasible.
> > 
> > > If I were a sysadmin, I'd rather opt for something like
> > > finegrained timestamps + lazytime (if I needed the finegrained timestamps
> > > functionality). That should avoid the IO overhead of finegrained 
> > > timestamps
> > > as well and I'd know I can have problems with timestamps only after a
> > > system crash.
> > 
> > > I've just got another idea how we could solve the problem: Couldn't we
> > > always just report coarsegrained timestamp to userspace and provide access
> > > to finegrained value only to NFS which should know what it's doing?
> > > 
> > 
> > I think that'd be hard. First of all, where would we store the second
> > timestamp? We can't just truncate the fine-grained ones to come up with
> > a coarse-grained one. It might also be confusing having nfsd and local
> > filesystems present different attributes.
> 
> As far as I can tell we have two options. The first one is to make this
> into a mount option which I really think isn't a big deal and lets us
> avoid this whole problem while allowing filesytems exposed via NFS to
> make use of this feature for change tracking.
> 
> The second option is that we turn off fine-grained finestamps for v6.6
> and you get to explore other options.
> 
> It isn't a big deal regressions like this were always to be expected but
> v6.6 needs to stabilize so anything that requires more significant work
> is not an option.

Oh, absolutely.

I wasn't proposing to do that work for v6.6. For that, we absolutely
either need the mount option or to just revert the mgtime conversions.

My plan was to take a stab at doing this for a later kernel release.
This is very much a "back to the drawing board" idea. It may not pan out
after all, but if it does then we could consider removing the mount
option at that point.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-20 Thread Jeff Layton
 @@ -44,6 +44,7 @@ static const struct constant_table common_set_sb_flag[] = 
> > {
> > { "mand",   SB_MANDLOCK },
> > { "ro", SB_RDONLY },
> > { "sync",   SB_SYNCHRONOUS },
> > +   { "mgtime", SB_MGTIME },
> > { },
> >  };
> >  
> > @@ -52,18 +53,32 @@ static const struct constant_table 
> > common_clear_sb_flag[] = {
> > { "nolazytime", SB_LAZYTIME },
> > { "nomand", SB_MANDLOCK },
> > { "rw", SB_RDONLY },
> > +   { "nomgtime",   SB_MGTIME },
> > { },
> >  };
> >  
> > +static inline int check_mgtime(unsigned int token, const struct fs_context 
> > *fc)
> > +{
> > +   if (token != SB_MGTIME)
> > +   return 0;
> > +   if (!(fc->fs_type->fs_flags & FS_MGTIME))
> > +   return invalf(fc, "Filesystem doesn't support multi-grain 
> > timestamps");
> > +   return 0;
> > +}
> > +
> >  /*
> >   * Check for a common mount option that manipulates s_flags.
> >   */
> >  static int vfs_parse_sb_flag(struct fs_context *fc, const char *key)
> >  {
> > unsigned int token;
> > +   int ret;
> >  
> > token = lookup_constant(common_set_sb_flag, key, 0);
> > if (token) {
> > +   ret = check_mgtime(token, fc);
> > +   if (ret)
> > +   return ret;
> > fc->sb_flags |= token;
> > fc->sb_flags_mask |= token;
> > return 0;
> > @@ -71,6 +86,9 @@ static int vfs_parse_sb_flag(struct fs_context *fc, const 
> > char *key)
> >  
> > token = lookup_constant(common_clear_sb_flag, key, 0);
> > if (token) {
> > +   ret = check_mgtime(token, fc);
> > +   if (ret)
> > +   return ret;
> > fc->sb_flags &= ~token;
> > fc->sb_flags_mask |= token;
> > return 0;
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 54237f4242ff..fd1a2390aaa3 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2141,7 +2141,7 @@ EXPORT_SYMBOL(current_mgtime);
> >  
> >  static struct timespec64 current_ctime(struct inode *inode)
> >  {
> > -   if (is_mgtime(inode))
> > +   if (IS_MGTIME(inode))
> > return current_mgtime(inode);
> > return current_time(inode);
> >  }
> > @@ -2588,7 +2588,7 @@ struct timespec64 inode_set_ctime_current(struct 
> > inode *inode)
> > now = current_time(inode);
> >  
> > /* Just copy it into place if it's not multigrain */
> > -   if (!is_mgtime(inode)) {
> > +   if (!IS_MGTIME(inode)) {
> > inode_set_ctime_to_ts(inode, now);
> > return now;
> > }
> > diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> > index 250eb5bf7b52..08f5bf4d2c6c 100644
> > --- a/fs/proc_namespace.c
> > +++ b/fs/proc_namespace.c
> > @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct 
> > super_block *sb)
> > { SB_DIRSYNC, ",dirsync" },
> > { SB_MANDLOCK, ",mand" },
> > { SB_LAZYTIME, ",lazytime" },
> > +   { SB_MGTIME, ",mgtime" },
> > { 0, NULL }
> > };
> > const struct proc_fs_opts *fs_infop;
> > diff --git a/fs/stat.c b/fs/stat.c
> > index 6e60389d6a15..2f18dd5de18b 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -90,7 +90,7 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 
> > request_mask,
> > stat->size = i_size_read(inode);
> > stat->atime = inode->i_atime;
> >  
> > -   if (is_mgtime(inode)) {
> > +   if (IS_MGTIME(inode)) {
> > fill_mg_cmtime(stat, request_mask, inode);
> > } else {
> > stat->mtime = inode->i_mtime;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 4aeb3fa11927..03e415fb3a7c 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1114,6 +1114,7 @@ extern int send_sigurg(struct fown_struct *fown);
> >  #define SB_NODEVBIT(2) /* Disallow access to device special 
> > files */
> >  #define SB_NOEXEC   BIT(3) /* Disallow program execution */
> >  #define SB_SYNCHRONOUS  BIT(4) /* Writes are synced at once */
> > +#define SB_MGTIME  BIT(5)  /* Use multi-grain timestamps */
> >  #define SB_MANDLOCK BIT(6) /* Allow mandatory locks on an FS */
> >  #define SB_DIRSYNC  BIT(7) /* Directory modifications are 
> > synchronous */
> >  #define SB_NOATIME  BIT(10)/* Do not update access times. */
> > @@ -2105,6 +2106,7 @@ static inline bool sb_rdonly(const struct super_block 
> > *sb) { return sb->s_flags
> > ((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
> >  #define IS_MANDLOCK(inode) __IS_FLG(inode, SB_MANDLOCK)
> >  #define IS_NOATIME(inode)  __IS_FLG(inode, SB_RDONLY|SB_NOATIME)
> > +#define IS_MGTIME(inode)   __IS_FLG(inode, SB_MGTIME)
> >  #define IS_I_VERSION(inode)__IS_FLG(inode, SB_I_VERSION)
> >  
> >  #define IS_NOQUOTA(inode)  ((inode)->i_flags & S_NOQUOTA)
> > @@ -2366,7 +2368,7 @@ struct file_system_type {
> >   */
> >  static inline bool is_mgtime(const struct inode *inode)
> >  {
> > -   return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> > +   return inode->i_sb->s_flags & SB_MGTIME;
> >  }
> >  
> >  extern struct dentry *mount_bdev(struct file_system_type *fs_type,
> > -- 
> > 2.34.1
> > 

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-20 Thread Jeff Layton
;
>   fc->sb_flags &= ~token;
>   fc->sb_flags_mask |= token;
>   return 0;
> diff --git a/fs/inode.c b/fs/inode.c
> index 54237f4242ff..fd1a2390aaa3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2141,7 +2141,7 @@ EXPORT_SYMBOL(current_mgtime);
>  
> 
>  static struct timespec64 current_ctime(struct inode *inode)
>  {
> - if (is_mgtime(inode))
> + if (IS_MGTIME(inode))
>   return current_mgtime(inode);
>   return current_time(inode);
>  }
> @@ -2588,7 +2588,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
> *inode)
>   now = current_time(inode);
>  
> 
>   /* Just copy it into place if it's not multigrain */
> - if (!is_mgtime(inode)) {
> + if (!IS_MGTIME(inode)) {
>   inode_set_ctime_to_ts(inode, now);
>   return now;
>   }
> diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c
> index 250eb5bf7b52..08f5bf4d2c6c 100644
> --- a/fs/proc_namespace.c
> +++ b/fs/proc_namespace.c
> @@ -49,6 +49,7 @@ static int show_sb_opts(struct seq_file *m, struct 
> super_block *sb)
>   { SB_DIRSYNC, ",dirsync" },
>   { SB_MANDLOCK, ",mand" },
>   { SB_LAZYTIME, ",lazytime" },
> + { SB_MGTIME, ",mgtime" },
>   { 0, NULL }
>   };
>   const struct proc_fs_opts *fs_infop;
> diff --git a/fs/stat.c b/fs/stat.c
> index 6e60389d6a15..2f18dd5de18b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -90,7 +90,7 @@ void generic_fillattr(struct mnt_idmap *idmap, u32 
> request_mask,
>   stat->size = i_size_read(inode);
>   stat->atime = inode->i_atime;
>  
> 
> - if (is_mgtime(inode)) {
> + if (IS_MGTIME(inode)) {
>   fill_mg_cmtime(stat, request_mask, inode);
>   } else {
>   stat->mtime = inode->i_mtime;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 4aeb3fa11927..03e415fb3a7c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1114,6 +1114,7 @@ extern int send_sigurg(struct fown_struct *fown);
>  #define SB_NODEVBIT(2)   /* Disallow access to device special 
> files */
>  #define SB_NOEXEC   BIT(3)   /* Disallow program execution */
>  #define SB_SYNCHRONOUS  BIT(4)   /* Writes are synced at once */
> +#define SB_MGTIMEBIT(5)  /* Use multi-grain timestamps */
>  #define SB_MANDLOCK BIT(6)   /* Allow mandatory locks on an FS */
>  #define SB_DIRSYNC  BIT(7)   /* Directory modifications are 
> synchronous */
>  #define SB_NOATIME  BIT(10)  /* Do not update access times. */
> @@ -2105,6 +2106,7 @@ static inline bool sb_rdonly(const struct super_block 
> *sb) { return sb->s_flags
>   ((inode)->i_flags & (S_SYNC|S_DIRSYNC)))
>  #define IS_MANDLOCK(inode)   __IS_FLG(inode, SB_MANDLOCK)
>  #define IS_NOATIME(inode)__IS_FLG(inode, SB_RDONLY|SB_NOATIME)
> +#define IS_MGTIME(inode) __IS_FLG(inode, SB_MGTIME)
>  #define IS_I_VERSION(inode)  __IS_FLG(inode, SB_I_VERSION)
>  
> 
>  #define IS_NOQUOTA(inode)((inode)->i_flags & S_NOQUOTA)
> @@ -2366,7 +2368,7 @@ struct file_system_type {
>   */
>  static inline bool is_mgtime(const struct inode *inode)
>  {
> - return inode->i_sb->s_type->fs_flags & FS_MGTIME;
> + return inode->i_sb->s_flags & SB_MGTIME;
>  }
>  
> 
>  extern struct dentry *mount_bdev(struct file_system_type *fs_type,

The mount option looks reasonable. Thanks for throwing together the
patch. Maybe in the future we can come up with a way to mitigate the
problems and do this unconditionally?

Reviewed-by: Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-19 Thread Jeff Layton
On Tue, 2023-09-19 at 13:10 -0700, Paul Eggert wrote:
> On 2023-09-19 09:31, Jeff Layton wrote:
> > The typical case for make
> > timestamp comparisons is comparing source files vs. a build target. If
> > those are being written nearly simultaneously, then that could be an
> > issue, but is that a typical behavior?
> 
> I vaguely remember running into problems with 'make' a while ago 
> (perhaps with a BSDish system) when filesystem timestamps were 
> arbitrarily truncated in some cases but not others. These files would 
> look older than they really were, so 'make' would think they were 
> up-to-date when they weren't, and 'make' would omit actions that it 
> should have done, thus screwing up the build.
> 
> File timestamps can be close together with 'make -j' on fast hosts. 
> Sometimes a shell script (or 'make' itself) will run 'make', then modify 
> a file F, then immediately run 'make' again; the latter 'make' won't 
> work if F's timestamp is mistakenly older than targets that depend on it.
> 
> Although 'make'-like apps are the biggest canaries in this coal mine, 
> the issue also affects 'find -newer' (as Bruno mentioned), 'rsync -u', 
> 'mv -u', 'tar -u', Emacs file-newer-than-file-p, and surely many other 
> places. For example, any app that creates a timestamp file, then backs 
> up all files newer than that file, would be at risk.
> 
> 
> > I wonder if it would be feasible to just advance the coarse-grained
> > current_time whenever we end up updating a ctime with a fine-grained
> > timestamp?
> 
> Wouldn't this need to be done globally, that is, not just on a per-file 
> or per-filesystem basis? If so, I don't see how we'd avoid locking 
> performance issues.
> 

Maybe. Another idea might be to introduce a new timekeeper for
multigrain filesystems, but all of those would likely have to share the
same coarse-grained clock source.

So yeah, if you stat an inode and then update it, any inode written on a
multigrain filesystem within the same jiffy-sized window would have to
log an extra transaction to write out the inode. That's what I meant
when I was talking about write amplification.

> 
> PS. Although I'm no expert in the Linux inode code I hope you don't mind 
> my asking a question about this part of inode_set_ctime_current:
> 
>   /*
>* If we've recently updated with a fine-grained timestamp,
>* then the coarse-grained one may still be earlier than the
>* existing ctime. Just keep the existing value if so.
>*/
>   ctime.tv_sec = inode->__i_ctime.tv_sec;
>   if (timespec64_compare(&ctime, &now) > 0)
>   return ctime;
> 
> Suppose root used clock_settime to set the clock backwards. Won't this 
> code incorrectly refuse to update the file's timestamp afterwards? That 
> is, shouldn't the last line be "goto fine_grained;" rather than "return 
> ctime;", with the comment changed from "keep the existing value" to "use 
> a fine-grained value"?

It is a problem, and Linus pointed that out yesterday, which is why I
sent this earlier today:

https://lore.kernel.org/linux-fsdevel/20230919-ctime-v1-1-97b3da92f...@kernel.org/T/#u

Bear in mind that we're not dealing with a situation where the value has
not been queried since its last update, so we don't need to use a fine
grained timestamp there (and really, it's preferable not to do so). A
coarse one should be fine in this case.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-19 Thread Jeff Layton
On Tue, 2023-09-19 at 16:52 +0200, Bruno Haible wrote:
> Jeff Layton wrote:
> > I'm not sure what we can do for this test. The nap() function is making
> > an assumption that the timestamp granularity will be constant, and that
> > isn't necessarily the case now.
> 
> This is only of secondary importance, because the scenario by Jan Kara
> shows a much more fundamental breakage:
> 
> > > The ultimate problem is that a sequence like:
> > > 
> > > write(f1)
> > > stat(f2)
> > > write(f2)
> > > stat(f2)
> > > write(f1)
> > > stat(f1)
> > > 
> > > can result in f1 timestamp to be (slightly) lower than the final f2
> > > timestamp because the second write to f1 didn't bother updating the
> > > timestamp. That can indeed be a bit confusing to programs if they compare
> > > timestamps between two files. Jeff?
> > > 
> > 
> > Basically yes.
> 
> f1 was last written to *after* f2 was last written to. If the timestamp of f1
> is then lower than the timestamp of f2, timestamps are fundamentally broken.
> 
> Many things in user-space depend on timestamps, such as build system
> centered around 'make', but also 'find ... -newer ...'.
> 


What does breakage with make look like in this situation? The "fuzz"
here is going to be on the order of a jiffy. The typical case for make
timestamp comparisons is comparing source files vs. a build target. If
those are being written nearly simultaneously, then that could be an
issue, but is that a typical behavior? It seems like it would be hard to
rely on that anyway, esp. given filesystems like NFS that can do lazy
writeback.

One of the operating principles with this series is that timestamps can
be of varying granularity between different files. Note that Linux
already violates this assumption when you're working across filesystems
of different types.

As to potential fixes if this is a real problem:

I don't really want to put this behind a mount or mkfs option (a'la
relatime, etc.), but that is one possibility.

I wonder if it would be feasible to just advance the coarse-grained
current_time whenever we end up updating a ctime with a fine-grained
timestamp? It might produce some inode write amplification. Files that
were written within the same jiffy could see more inode transactions
logged, but that still might not be _too_ awful.

I'll keep thinking about it for now.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-09-19 Thread Jeff Layton
On Tue, 2023-09-19 at 13:04 +0200, Jan Kara wrote:
> On Tue 19-09-23 15:05:24, Xi Ruoyao wrote:
> > On Mon, 2023-08-07 at 15:38 -0400, Jeff Layton wrote:
> > > Enable multigrain timestamps, which should ensure that there is an
> > > apparent change to the timestamp whenever it has been written after
> > > being actively observed via getattr.
> > > 
> > > For ext4, we only need to enable the FS_MGTIME flag.
> > 
> > Hi Jeff,
> > 
> > This patch causes a gnulib test failure:
> > 
> > $ ~/sources/lfs/grep-3.11/gnulib-tests/test-stat-time
> > test-stat-time.c:141: assertion 'statinfo[0].st_mtime < 
> > statinfo[2].st_mtime || (statinfo[0].st_mtime == statinfo[2].st_mtime && 
> > (get_stat_mtime_ns (&statinfo[0]) < get_stat_mtime_ns (&statinfo[2])))' 
> > failed
> > Aborted (core dumped)
> > 
> > The source code of the test:
> > https://git.savannah.gnu.org/cgit/gnulib.git/tree/tests/test-stat-time.c
> > 
> > Is this an expected change?
> 
> Kind of yes. The test first tries to estimate filesystem timestamp
> granularity in nap() function - due to this patch, the detected granularity
> will likely be 1 ns so effectively all the test calls will happen
> immediately one after another. But we don't bother setting the timestamps
> with more than 1 jiffy (usually 4 ms) precision unless we think someone is
> watching. So as a result timestamps of all stamp1 and stamp2 files are
> going to be equal which makes the test fail.
> 

That was my take too. The multigrain ctime changes are probably causing
nap() to settle on too small a time delta.

> The ultimate problem is that a sequence like:
> 
> write(f1)
> stat(f2)
> write(f2)
> stat(f2)
> write(f1)
> stat(f1)
>
> can result in f1 timestamp to be (slightly) lower than the final f2
> timestamp because the second write to f1 didn't bother updating the
> timestamp. That can indeed be a bit confusing to programs if they compare
> timestamps between two files. Jeff?
> 

Basically yes. When there is no stat() call issued on the file in
between writes, the kernel will use coarse-grained timestamps when
updating it (since no one is watching).


I'm not sure what we can do for this test. The nap() function is making
an assumption that the timestamp granularity will be constant, and that
isn't necessarily the case now.

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH 6/7] dlm: use FL_SLEEP to determine blocking vs non-blocking

2023-08-30 Thread Jeff Layton
On Wed, 2023-08-30 at 08:38 -0400, Alexander Aring wrote:
> Hi,
> 
> On Fri, Aug 25, 2023 at 2:18 PM Jeff Layton  wrote:
> > 
> > On Wed, 2023-08-23 at 17:33 -0400, Alexander Aring wrote:
> > > This patch uses the FL_SLEEP flag in struct file_lock to determine if
> > > the lock request is a blocking or non-blocking request. Before dlm was
> > > using IS_SETLKW() was being used which is not usable for lock requests
> > > coming from lockd when EXPORT_OP_SAFE_ASYNC_LOCK inside the export flags
> > > is set.
> > > 
> > > Signed-off-by: Alexander Aring 
> > > ---
> > >  fs/dlm/plock.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > index 0094fa4004cc..0c6ed5eeb840 100644
> > > --- a/fs/dlm/plock.c
> > > +++ b/fs/dlm/plock.c
> > > @@ -140,7 +140,7 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> > > number, struct file *file,
> > >   op->info.optype = DLM_PLOCK_OP_LOCK;
> > >   op->info.pid= fl->fl_pid;
> > >   op->info.ex = (fl->fl_type == F_WRLCK);
> > > - op->info.wait   = IS_SETLKW(cmd);
> > > + op->info.wait   = !!(fl->fl_flags & FL_SLEEP);
> > >   op->info.fsid   = ls->ls_global_id;
> > >   op->info.number = number;
> > >   op->info.start  = fl->fl_start;
> > 
> > Not sure you really need the !!, but ok...
> > 
> 
> The wait is a byte value and FL_SLEEP doesn't fit into it, I already
> run into problems with it. I don't think somebody does a if (foo->wait
> == 1) but it should be set to 1 or 0.
> 

AIUI, any halfway decent compiler should take the result of the &, and
implicitly cast that properly to bool. Basically, any value other than 0
should be true.

If the compiler just blindly casts the lowest byte though, then you do
need the double-negative.

> An alternative would be: ((fl->fl_flags & FL_SLEEP) == FL_SLEEP). I am
> not sure what the coding style says here. I think it's more important
> what the C standard says about !!(condition), but there are other
> users of this in the Linux kernel. :-/

I don't care too much either way, but my understanding was that you
don't need to do the !! trick in most cases with modern compilers.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-08-29 Thread Jeff Layton
On Wed, 2023-08-30 at 01:19 +0100, Al Viro wrote:
> On Wed, Jul 05, 2023 at 02:58:11PM -0400, Jeff Layton wrote:
> 
> > + * POSIX mandates that the old and new parent directories have their ctime 
> > and
> > + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> > have
> > + * their ctime updated.
> 
> APPLICATION USAGE
> Some implementations mark for update the last file status change timestamp
> of renamed files and some do not. Applications which make use of the
> last file status change timestamp may behave differently with respect
> to renamed files unless they are designed to allow for either behavior.
>
> So for children POSIX permits rather than mandates.  Doesn't really matter;
> Linux behaviour had been to touch ctime on children since way back, if
> not since the very beginning.

Mea culpa. You're quite correct. I'll plan to roll a small patch to
update the comment over this function.

Thanks!
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v6 1/7] fs: pass the request_mask to generic_fillattr

2023-08-29 Thread Jeff Layton
On Wed, 2023-08-30 at 01:02 +0100, Al Viro wrote:
> On Tue, Aug 29, 2023 at 06:58:47PM -0400, Jeff Layton wrote:
> > On Tue, 2023-08-29 at 23:44 +0100, Al Viro wrote:
> > > On Tue, Jul 25, 2023 at 10:58:14AM -0400, Jeff Layton wrote:
> > > > generic_fillattr just fills in the entire stat struct indiscriminately
> > > > today, copying data from the inode. There is at least one attribute
> > > > (STATX_CHANGE_COOKIE) that can have side effects when it is reported,
> > > > and we're looking at adding more with the addition of multigrain
> > > > timestamps.
> > > > 
> > > > Add a request_mask argument to generic_fillattr and have most callers
> > > > just pass in the value that is passed to getattr. Have other callers
> > > > (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
> > > > STATX_CHANGE_COOKIE into generic_fillattr.
> > > 
> > > Out of curiosity - how much PITA would it be to put request_mask into
> > > kstat?  Set it in vfs_getattr_nosec() (and those get_file_..._info()
> > > on smbd side) and don't bother with that kind of propagation boilerplate
> > > - just have generic_fillattr() pick it there...
> > > 
> > > Reduces the patchset size quite a bit...
> > 
> > It could be done. To do that right, I think we'd want to drop
> > request_mask from the ->getattr prototype as well and just have
> > everything use the mask in the kstat.
> > 
> > I don't think it'd reduce the size of the patchset in any meaningful
> > way, but it might make for a more sensible API over the long haul.
> 
> ->getattr() prototype change would be decoupled from that - for your
> patchset you'd only need the field addition + setting in vfs_getattr_nosec()
> (and possibly in ksmbd), with the remainders of both series being
> independent from each other.
> 
> What I suggest is
> 
> branchpoint -> field addition (trivial commit) -> argument removal
>   |
>   V
> your series, starting with "use stat->request_mask in generic_fillattr()"
> 
> Total size would be about the same, but it would be easier to follow
> the less trivial part of that.  Nothing in your branch downstream of
> that touches any ->getattr() instances, so it should have no
> conflicts with the argument removal side of things.

The only problem with this plan is that Linus has already merged this.
I've no issue with adding the request_mask to the kstat and removing it
as a separate parameter elsewhere, but I think we'll need to do it on
top of what's already been merged.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v6 1/7] fs: pass the request_mask to generic_fillattr

2023-08-29 Thread Jeff Layton
On Tue, 2023-08-29 at 23:44 +0100, Al Viro wrote:
> On Tue, Jul 25, 2023 at 10:58:14AM -0400, Jeff Layton wrote:
> > generic_fillattr just fills in the entire stat struct indiscriminately
> > today, copying data from the inode. There is at least one attribute
> > (STATX_CHANGE_COOKIE) that can have side effects when it is reported,
> > and we're looking at adding more with the addition of multigrain
> > timestamps.
> > 
> > Add a request_mask argument to generic_fillattr and have most callers
> > just pass in the value that is passed to getattr. Have other callers
> > (e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
> > STATX_CHANGE_COOKIE into generic_fillattr.
> 
> Out of curiosity - how much PITA would it be to put request_mask into
> kstat?  Set it in vfs_getattr_nosec() (and those get_file_..._info()
> on smbd side) and don't bother with that kind of propagation boilerplate
> - just have generic_fillattr() pick it there...
> 
> Reduces the patchset size quite a bit...

It could be done. To do that right, I think we'd want to drop
request_mask from the ->getattr prototype as well and just have
everything use the mask in the kstat.

I don't think it'd reduce the size of the patchset in any meaningful
way, but it might make for a more sensible API over the long haul.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH 6/7] dlm: use FL_SLEEP to determine blocking vs non-blocking

2023-08-25 Thread Jeff Layton
On Wed, 2023-08-23 at 17:33 -0400, Alexander Aring wrote:
> This patch uses the FL_SLEEP flag in struct file_lock to determine if
> the lock request is a blocking or non-blocking request. Before dlm was
> using IS_SETLKW() was being used which is not usable for lock requests
> coming from lockd when EXPORT_OP_SAFE_ASYNC_LOCK inside the export flags
> is set.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/dlm/plock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 0094fa4004cc..0c6ed5eeb840 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -140,7 +140,7 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   op->info.optype = DLM_PLOCK_OP_LOCK;
>   op->info.pid= fl->fl_pid;
>   op->info.ex = (fl->fl_type == F_WRLCK);
> - op->info.wait   = IS_SETLKW(cmd);
> + op->info.wait   = !!(fl->fl_flags & FL_SLEEP);
>   op->info.fsid   = ls->ls_global_id;
>   op->info.number = number;
>   op->info.start  = fl->fl_start;

Not sure you really need the !!, but ok...

Reviewed-by: Jeff Layton 



Re: [Cluster-devel] [PATCH 4/7] lockd: add doc to enable EXPORT_OP_SAFE_ASYNC_LOCK

2023-08-25 Thread Jeff Layton
On Wed, 2023-08-23 at 17:33 -0400, Alexander Aring wrote:
> This patch adds a note to enable EXPORT_OP_SAFE_ASYNC_LOCK for
> asynchronous lock request handling.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/locks.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index df8b26a42524..edee02d1ca93 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2255,11 +2255,13 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, 
> struct flock *flock)
>   * To avoid blocking kernel daemons, such as lockd, that need to acquire 
> POSIX
>   * locks, the ->lock() interface may return asynchronously, before the lock 
> has
>   * been granted or denied by the underlying filesystem, if (and only if)
> - * lm_grant is set. Callers expecting ->lock() to return asynchronously
> - * will only use F_SETLK, not F_SETLKW; they will set FL_SLEEP if (and only 
> if)
> - * the request is for a blocking lock. When ->lock() does return 
> asynchronously,
> - * it must return FILE_LOCK_DEFERRED, and call ->lm_grant() when the lock
> - * request completes.
> + * lm_grant is set. Additionally EXPORT_OP_SAFE_ASYNC_LOCK in 
> export_operations
> + * flags need to be set.
> + *
> + * Callers expecting ->lock() to return asynchronously will only use F_SETLK,
> + * not F_SETLKW; they will set FL_SLEEP if (and only if) the request is for a
> + * blocking lock. When ->lock() does return asynchronously, it must return
> + * FILE_LOCK_DEFERRED, and call ->lm_grant() when the lock request completes.
>   * If the request is for non-blocking lock the file system should return
>   * FILE_LOCK_DEFERRED then try to get the lock and call the callback routine
>   * with the result. If the request timed out the callback routine will 
> return a

Reviewed-by: Jeff Layton 



Re: [Cluster-devel] [PATCH 3/7] lockd: fix race in async lock request handling

2023-08-25 Thread Jeff Layton
On Wed, 2023-08-23 at 17:33 -0400, Alexander Aring wrote:
> This patch fixes a race in async lock request handling between adding
> the relevant struct nlm_block to nlm_blocked list after the request was
> sent by vfs_lock_file() and nlmsvc_grant_deferred() does a lookup of the
> nlm_block in the nlm_blocked list. It could be that the async request is
> completed before the nlm_block was added to the list. This would end
> in a -ENOENT and a kernel log message of "lockd: grant for unknown
> block".
> 
> To solve this issue we add the nlm_block before the vfs_lock_file() call
> to be sure it has been added when a possible nlmsvc_grant_deferred() is
> called. If the vfs_lock_file() results in an case when it wouldn't be
> added to nlm_blocked list, the nlm_block struct will be removed from
> this list again.
> 
> The introducing of the new B_PENDING_CALLBACK nlm_block flag will handle
> async lock requests on a pending lock requests as a retry on the caller
> level to hit the -EAGAIN case.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/lockd/svclock.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index aa4174fbaf5b..3b158446203b 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -546,6 +546,9 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
>   ret = nlm_lck_blocked;
>   goto out;
>   }
> +
> + /* Append to list of blocked */
> + nlmsvc_insert_block_locked(block, NLM_NEVER);
>   spin_unlock(&nlm_blocked_lock);
>  
>   if (!wait)
> @@ -557,9 +560,12 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file 
> *file,
>   dprintk("lockd: vfs_lock_file returned %d\n", error);
>   switch (error) {
>   case 0:
> + nlmsvc_remove_block(block);
>   ret = nlm_granted;
>   goto out;
>   case -EAGAIN:
> + if (!wait)
> + nlmsvc_remove_block(block);
>   ret = async_block ? nlm_lck_blocked : nlm_lck_denied;
>   goto out;
>   case FILE_LOCK_DEFERRED:
> @@ -570,17 +576,16 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file 
> *file,
>   ret = nlmsvc_defer_lock_rqst(rqstp, block);
>   goto out;
>   case -EDEADLK:
> + nlmsvc_remove_block(block);
>   ret = nlm_deadlock;
>   goto out;
>   default:/* includes ENOLCK */
> + nlmsvc_remove_block(block);
>   ret = nlm_lck_denied_nolocks;
>   goto out;
>   }
>  
>   ret = nlm_lck_blocked;
> -
> - /* Append to list of blocked */
> - nlmsvc_insert_block(block, NLM_NEVER);
>  out:
>   mutex_unlock(&file->f_mutex);
>   nlmsvc_release_block(block);

Reviewed-by: Jeff Layton 



Re: [Cluster-devel] [PATCH 1/7] lockd: introduce safe async lock op

2023-08-25 Thread Jeff Layton
On Wed, 2023-08-23 at 17:33 -0400, Alexander Aring wrote:
> This patch reverts mostly commit 40595cdc93ed ("nfs: block notification
> on fs with its own ->lock") and introduces an EXPORT_OP_SAFE_ASYNC_LOCK
> export flag to signal that the "own ->lock" implementation supports
> async lock requests. The only main user is DLM that is used by GFS2 and
> OCFS2 filesystem. Those implement their own lock() implementation and
> return FILE_LOCK_DEFERRED as return value. Since commit 40595cdc93ed
> ("nfs: block notification on fs with its own ->lock") the DLM
> implementation were never updated. This patch should prepare for DLM
> to set the EXPORT_OP_SAFE_ASYNC_LOCK export flag and update the DLM
> plock implementation regarding to it.
> 
> Acked-by: Jeff Layton 
> Signed-off-by: Alexander Aring 
> ---
>  fs/lockd/svclock.c   |  5 ++---
>  fs/nfsd/nfs4state.c  | 13 ++---
>  include/linux/exportfs.h |  8 
>  3 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index c43ccdf28ed9..6e3b230e8317 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -470,9 +470,7 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
>   struct nlm_host *host, struct nlm_lock *lock, int wait,
>   struct nlm_cookie *cookie, int reclaim)
>  {
> -#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>   struct inode*inode = nlmsvc_file_inode(file);
> -#endif
>   struct nlm_block*block = NULL;
>   int error;
>   int mode;
> @@ -486,7 +484,8 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
>   (long long)lock->fl.fl_end,
>   wait);
>  
> - if (nlmsvc_file_file(file)->f_op->lock) {
> + if (!export_op_support_safe_async_lock(inode->i_sb->s_export_op,
> +nlmsvc_file_file(file)->f_op)) {
>   async_block = wait;
>   wait = 0;
>   }
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 3aefbad4cc09..14ca06424ff1 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -7430,6 +7430,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   struct nfsd4_blocked_lock *nbl = NULL;
>   struct file_lock *file_lock = NULL;
>   struct file_lock *conflock = NULL;
> + struct super_block *sb;
>   __be32 status = 0;
>   int lkflg;
>   int err;
> @@ -7451,6 +7452,7 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   dprintk("NFSD: nfsd4_lock: permission denied!\n");
>   return status;
>   }
> + sb = cstate->current_fh.fh_dentry->d_sb;
>  
>   if (lock->lk_is_new) {
>   if (nfsd4_has_session(cstate))
> @@ -7502,7 +7504,9 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   fp = lock_stp->st_stid.sc_file;
>   switch (lock->lk_type) {
>   case NFS4_READW_LT:
> - if (nfsd4_has_session(cstate))
> + if (nfsd4_has_session(cstate) ||
> + export_op_support_safe_async_lock(sb->s_export_op,
> +   
> nf->nf_file->f_op))
>   fl_flags |= FL_SLEEP;
>   fallthrough;
>   case NFS4_READ_LT:
> @@ -7514,7 +7518,9 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>   fl_type = F_RDLCK;
>   break;
>   case NFS4_WRITEW_LT:
> - if (nfsd4_has_session(cstate))
> + if (nfsd4_has_session(cstate) ||
> + export_op_support_safe_async_lock(sb->s_export_op,
> +   
> nf->nf_file->f_op))
>   fl_flags |= FL_SLEEP;
>   fallthrough;
>   case NFS4_WRITE_LT:
> @@ -7542,7 +7548,8 @@ nfsd4_lock(struct svc_rqst *rqstp, struct 
> nfsd4_compound_state *cstate,
>* for file locks), so don't attempt blocking lock notifications
>* on those filesystems:
>*/
> - if (nf->nf_file->f_op->lock)
> + if (!export_op_support_safe_async_lock(sb->s_export_op,
> +nf->nf_file->f_op))
>   fl_flags &= ~FL_SLEEP;
>  
>   nbl = find_or_allocate_block(lock_so

Re: [Cluster-devel] [PATCH 2/7] lockd: don't call vfs_lock_file() for pending requests

2023-08-25 Thread Jeff Layton
On Wed, 2023-08-23 at 17:33 -0400, Alexander Aring wrote:
> This patch returns nlm_lck_blocked in nlmsvc_lock() when an asynchronous
> lock request is pending. During testing I ran into the case with the
> side-effects that lockd is waiting for only one lm_grant() callback
> because it's already part of the nlm_blocked list. If another
> asynchronous for the same nlm_block is triggered two lm_grant()
> callbacks will occur but lockd was only waiting for one.
> 
> To avoid any change of existing users this handling will only being made
> when export_op_support_safe_async_lock() returns true.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/lockd/svclock.c | 24 +---
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index 6e3b230e8317..aa4174fbaf5b 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -531,6 +531,23 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file 
> *file,
>   goto out;
>   }
>  
> + spin_lock(&nlm_blocked_lock);
> + /*
> +  * If this is a lock request for an already pending
> +  * lock request we return nlm_lck_blocked without calling
> +  * vfs_lock_file() again. Otherwise we have two pending
> +  * requests on the underlaying ->lock() implementation but
> +  * only one nlm_block to being granted by lm_grant().
> +  */
> + if (export_op_support_safe_async_lock(inode->i_sb->s_export_op,
> +   nlmsvc_file_file(file)->f_op) &&
> + !list_empty(&block->b_list)) {
> + spin_unlock(&nlm_blocked_lock);
> + ret = nlm_lck_blocked;
> + goto out;
> + }

Looks reasonable. The block->b_list check is subtle, but the comment
helps.

> + spin_unlock(&nlm_blocked_lock);
> +
>   if (!wait)
>   lock->fl.fl_flags &= ~FL_SLEEP;
>   mode = lock_to_openmode(&lock->fl);
> @@ -543,13 +560,6 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file 
> *file,
>   ret = nlm_granted;
>   goto out;
>   case -EAGAIN:
> - /*
> -  * If this is a blocking request for an
> -  * already pending lock request then we need
> -  * to put it back on lockd's block list
> -  */
> -         if (wait)
> - break;
>   ret = async_block ? nlm_lck_blocked : nlm_lck_denied;
>   goto out;
>   case FILE_LOCK_DEFERRED:


Reviewed-by: Jeff Layton 



Re: [Cluster-devel] [RFCv2 6/7] dlm: use FL_SLEEP to check if blocking request

2023-08-17 Thread Jeff Layton
On Wed, 2023-08-16 at 21:19 -0400, Alexander Aring wrote:
> Hi,
> 
> On Wed, Aug 16, 2023 at 9:07 AM Jeff Layton  wrote:
> > 
> > On Mon, 2023-08-14 at 17:11 -0400, Alexander Aring wrote:
> > > This patch uses the FL_SLEEP flag in struct file_lock to check if it's a
> > > blocking request in case if the request coming from nfs lockd process
> > > indicated by lm_grant() is set.
> > > 
> > > IF FL_SLEEP is set a asynchronous blocking request is being made and
> > > it's waiting for lm_grant() callback being called to signal the lock was
> > > granted. If it's not set a synchronous non-blocking request is being made.
> > > 
> > > Signed-off-by: Alexander Aring 
> > > ---
> > >  fs/dlm/plock.c | 38 ++
> > >  1 file changed, 22 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> > > index 0094fa4004cc..524771002a2f 100644
> > > --- a/fs/dlm/plock.c
> > > +++ b/fs/dlm/plock.c
> > > @@ -140,7 +140,6 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> > > number, struct file *file,
> > >   op->info.optype = DLM_PLOCK_OP_LOCK;
> > >   op->info.pid= fl->fl_pid;
> > >   op->info.ex = (fl->fl_type == F_WRLCK);
> > > - op->info.wait   = IS_SETLKW(cmd);
> > >   op->info.fsid   = ls->ls_global_id;
> > >   op->info.number = number;
> > >   op->info.start  = fl->fl_start;
> > > @@ -148,24 +147,31 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> > > number, struct file *file,
> > >   op->info.owner = (__u64)(long)fl->fl_owner;
> > >   /* async handling */
> > >   if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
> > > - op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
> > > - if (!op_data) {
> > > - dlm_release_plock_op(op);
> > > - rv = -ENOMEM;
> > > - goto out;
> > > - }
> > > + if (fl->fl_flags & FL_SLEEP) {
> > > + op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
> > > + if (!op_data) {
> > > + dlm_release_plock_op(op);
> > > + rv = -ENOMEM;
> > > + goto out;
> > > + }
> > > 
> > > - op_data->callback = fl->fl_lmops->lm_grant;
> > > - locks_init_lock(&op_data->flc);
> > > - locks_copy_lock(&op_data->flc, fl);
> > > - op_data->fl = fl;
> > > - op_data->file   = file;
> > > + op->info.wait = 1;
> > > + op_data->callback = fl->fl_lmops->lm_grant;
> > > + locks_init_lock(&op_data->flc);
> > > + locks_copy_lock(&op_data->flc, fl);
> > > + op_data->fl = fl;
> > > + op_data->file   = file;
> > > 
> > > - op->data = op_data;
> > > + op->data = op_data;
> > > 
> > > - send_op(op);
> > > - rv = FILE_LOCK_DEFERRED;
> > > - goto out;
> > > + send_op(op);
> > > + rv = FILE_LOCK_DEFERRED;
> > > + goto out;
> > 
> > A question...we're returning FILE_LOCK_DEFERRED after the DLM request is
> > sent. If it ends up being blocked, what happens? Does it do a lm_grant
> > downcall with -EAGAIN or something as the result?
> > 
> 
> no, when info->wait is set then it is a blocked lock request, which
> means lm_grant() will be called when the lock request is granted.
> 

Ok, that's probably problematic with the current code too. lockd will
time out the block after 7s, so if the lock isn't granted in that time
it'll give up on it.

> > 
> > > + } else {
> > > + op->info.wait = 0;
> > > + }
> > > + } else {
> > > + op->info.wait = IS_SETLKW(cmd);
> > >   }
> > > 
> > >   send_op(op);
> > 
> > Looks reasonable overall.
> > 
> > Now that I look, we have quite a number of places in the kernel that
> > seem to check for F_SETLKW, when what they really want is to check
> > FL_SLEEP.
> 
> Yes, so far I understand FL_SLEEP is F_SETLKW when you get only
> F_SETLK in case of fl->fl_lmops && fl->fl_lmops->lm_grant is true. It
> is confusing but this is how it works... if it's not set we will get
> F_SETLKW and this should imply FL_SLEEP is set.
> 
> 

Yeah. Might be good to consider how to make this more consistent across
the kernel.
-- 
Jeff Layton 



Re: [Cluster-devel] [RFCv2 6/7] dlm: use FL_SLEEP to check if blocking request

2023-08-16 Thread Jeff Layton
On Mon, 2023-08-14 at 17:11 -0400, Alexander Aring wrote:
> This patch uses the FL_SLEEP flag in struct file_lock to check if it's a
> blocking request in case if the request coming from nfs lockd process
> indicated by lm_grant() is set.
> 
> IF FL_SLEEP is set a asynchronous blocking request is being made and
> it's waiting for lm_grant() callback being called to signal the lock was
> granted. If it's not set a synchronous non-blocking request is being made.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/dlm/plock.c | 38 ++
>  1 file changed, 22 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 0094fa4004cc..524771002a2f 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -140,7 +140,6 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   op->info.optype = DLM_PLOCK_OP_LOCK;
>   op->info.pid= fl->fl_pid;
>   op->info.ex = (fl->fl_type == F_WRLCK);
> - op->info.wait   = IS_SETLKW(cmd);
>   op->info.fsid   = ls->ls_global_id;
>   op->info.number = number;
>   op->info.start  = fl->fl_start;
> @@ -148,24 +147,31 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   op->info.owner = (__u64)(long)fl->fl_owner;
>   /* async handling */
>   if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
> - op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
> - if (!op_data) {
> - dlm_release_plock_op(op);
> - rv = -ENOMEM;
> - goto out;
> - }
> + if (fl->fl_flags & FL_SLEEP) {
> + op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
> + if (!op_data) {
> + dlm_release_plock_op(op);
> + rv = -ENOMEM;
> + goto out;
> + }
>  
> - op_data->callback = fl->fl_lmops->lm_grant;
> - locks_init_lock(&op_data->flc);
> - locks_copy_lock(&op_data->flc, fl);
> - op_data->fl = fl;
> - op_data->file   = file;
> + op->info.wait = 1;
> + op_data->callback = fl->fl_lmops->lm_grant;
> + locks_init_lock(&op_data->flc);
> + locks_copy_lock(&op_data->flc, fl);
> + op_data->fl = fl;
> + op_data->file   = file;
>  
> - op->data = op_data;
> + op->data = op_data;
>  
> - send_op(op);
> - rv = FILE_LOCK_DEFERRED;
> - goto out;
> + send_op(op);
> + rv = FILE_LOCK_DEFERRED;
> + goto out;

A question...we're returning FILE_LOCK_DEFERRED after the DLM request is
sent. If it ends up being blocked, what happens? Does it do a lm_grant
downcall with -EAGAIN or something as the result?


> +     } else {
> + op->info.wait = 0;
> + }
> + } else {
> + op->info.wait = IS_SETLKW(cmd);
>   }
>  
>   send_op(op);

Looks reasonable overall.

Now that I look, we have quite a number of places in the kernel that
seem to check for F_SETLKW, when what they really want is to check
FL_SLEEP.
-- 
Jeff Layton 



Re: [Cluster-devel] [RFCv2 5/7] dlm: use fl_owner from lockd

2023-08-16 Thread Jeff Layton
On Mon, 2023-08-14 at 17:11 -0400, Alexander Aring wrote:
> This patch is changing the fl_owner value in case of an nfs lock request
> to not be the pid of lockd. Instead this patch changes it to be the
> owner value that nfs is giving us.
> 
> Currently there exists proved problems with this behaviour. One nfsd
> server was created to export a gfs2 filesystem mount. Two nfs clients
> doing a nfs mount of this export. Those two clients should conflict each
> other operating on the same nfs file.
> 
> A small test program was written:
> 
> int main(int argc, const char *argv[])
> {
>   struct flock fl = {
>   .l_type = F_WRLCK,
>   .l_whence = SEEK_SET,
>   .l_start = 1L,
>   .l_len = 1L,
>   };
>   int fd;
> 
>   fd = open("filename", O_RDWR | O_CREAT, 0700);
>   printf("try to lock...\n");
>   fcntl(fd, F_SETLKW, &fl);
>   printf("locked!\n");
>   getc(stdin);
> 
>   return 0;
> }
> 
> Running on both clients at the same time and don't interrupting by
> pressing any key. It will show that both clients are able to acquire the
> lock which shouldn't be the case. The issue is here that the fl_owner
> value is the same and the lock context of both clients should be
> separated.
> 
> This patch lets lockd define how to deal with lock contexts and chose
> hopefully the right fl_owner value. A test after this patch was made and
> the locks conflicts each other which should be the case.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/dlm/plock.c | 18 --
>  1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dlm/plock.c b/fs/dlm/plock.c
> index 00e1d802a81c..0094fa4004cc 100644
> --- a/fs/dlm/plock.c
> +++ b/fs/dlm/plock.c
> @@ -145,6 +145,7 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   op->info.number = number;
>   op->info.start  = fl->fl_start;
>   op->info.end= fl->fl_end;
> + op->info.owner = (__u64)(long)fl->fl_owner;
>   /* async handling */
>   if (fl->fl_lmops && fl->fl_lmops->lm_grant) {
>   op_data = kzalloc(sizeof(*op_data), GFP_NOFS);
> @@ -154,9 +155,6 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   goto out;
>   }
>  
> - /* fl_owner is lockd which doesn't distinguish
> -processes on the nfs client */
> - op->info.owner  = (__u64) fl->fl_pid;
>   op_data->callback = fl->fl_lmops->lm_grant;
>   locks_init_lock(&op_data->flc);
>   locks_copy_lock(&op_data->flc, fl);
> @@ -168,8 +166,6 @@ int dlm_posix_lock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   send_op(op);
>   rv = FILE_LOCK_DEFERRED;
>   goto out;
> - } else {
> - op->info.owner  = (__u64)(long) fl->fl_owner;
>   }
>  
>   send_op(op);
> @@ -326,10 +322,7 @@ int dlm_posix_unlock(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   op->info.number = number;
>   op->info.start  = fl->fl_start;
>   op->info.end= fl->fl_end;
> - if (fl->fl_lmops && fl->fl_lmops->lm_grant)
> - op->info.owner  = (__u64) fl->fl_pid;
> - else
> - op->info.owner  = (__u64)(long) fl->fl_owner;
> + op->info.owner = (__u64)(long)fl->fl_owner;
>  
>   if (fl->fl_flags & FL_CLOSE) {
>   op->info.flags |= DLM_PLOCK_FL_CLOSE;
> @@ -389,7 +382,7 @@ int dlm_posix_cancel(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   info.number = number;
>   info.start = fl->fl_start;
>   info.end = fl->fl_end;
> - info.owner = (__u64)fl->fl_pid;
> + info.owner = (__u64)(long)fl->fl_owner;
>  
>   rv = do_lock_cancel(&info);
>   switch (rv) {
> @@ -450,10 +443,7 @@ int dlm_posix_get(dlm_lockspace_t *lockspace, u64 
> number, struct file *file,
>   op->info.number = number;
>   op->info.start  = fl->fl_start;
>   op->info.end= fl->fl_end;
> - if (fl->fl_lmops && fl->fl_lmops->lm_grant)
> - op->info.owner  = (__u64) fl->fl_pid;
> - else
> - op->info.owner  = (__u64)(long) fl->fl_owner;
> + op->info.owner = (__u64)(long)fl->fl_owner;
>  
>   send_op(op);
>   wait_event(recv_wq, (op->done != 0));

This is the way.

Acked-by: Jeff Layton 



Re: [Cluster-devel] [RFCv2 4/7] locks: update lock callback documentation

2023-08-16 Thread Jeff Layton
On Mon, 2023-08-14 at 17:11 -0400, Alexander Aring wrote:
> This patch updates the existing documentation regarding recent changes
> to vfs_lock_file() and lm_grant() is set. In case of lm_grant() is set
> we only handle FILE_LOCK_DEFERRED in case of FL_SLEEP in fl_flags is not
> set. This is the case of an blocking lock request. Non-blocking lock
> requests, when FL_SLEEP is not set, are handled in a synchronized way.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/locks.c | 28 ++--
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index df8b26a42524..a8e51f462b43 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2255,21 +2255,21 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, 
> struct flock *flock)
>   * To avoid blocking kernel daemons, such as lockd, that need to acquire 
> POSIX
>   * locks, the ->lock() interface may return asynchronously, before the lock 
> has
>   * been granted or denied by the underlying filesystem, if (and only if)
> - * lm_grant is set. Callers expecting ->lock() to return asynchronously
> - * will only use F_SETLK, not F_SETLKW; they will set FL_SLEEP if (and only 
> if)
> - * the request is for a blocking lock. When ->lock() does return 
> asynchronously,
> - * it must return FILE_LOCK_DEFERRED, and call ->lm_grant() when the lock
> - * request completes.
> - * If the request is for non-blocking lock the file system should return
> - * FILE_LOCK_DEFERRED then try to get the lock and call the callback routine
> - * with the result. If the request timed out the callback routine will 
> return a
> + * lm_grant and FL_SLEEP in fl_flags is set. Callers expecting ->lock() to 
> return
> + * asynchronously will only use F_SETLK, not F_SETLKW; When ->lock() does 
> return

Isn't the above backward? Shouldn't it say "Callers expecting ->lock()
to return asynchronously will only use F_SETLKW, not F_SETLK" ?

> + * asynchronously, it must return FILE_LOCK_DEFERRED, and call ->lm_grant() 
> when
> + * the lock request completes. The lm_grant() callback must be called in a
> + * sleepable context.
> + *
> + * If the request timed out the ->lm_grant() callback routine will return a
>   * nonzero return code and the file system should release the lock. The file
> - * system is also responsible to keep a corresponding posix lock when it
> - * grants a lock so the VFS can find out which locks are locally held and do
> - * the correct lock cleanup when required.
> - * The underlying filesystem must not drop the kernel lock or call
> - * ->lm_grant() before returning to the caller with a FILE_LOCK_DEFERRED
> - * return code.
> + * system is also responsible to keep a corresponding posix lock when it 
> grants
> + * a lock so the VFS can find out which locks are locally held and do the 
> correct
> + * lock cleanup when required.
> + *
> + * If the request is for non-blocking lock (when F_SETLK and FL_SLEEP in 
> fl_flags is not set)
> + * the file system should return -EAGAIN if failed to acquire or zero if 
> acquiring was
> + * successfully without calling the ->lm_grant() callback routine.
>   */
>  int vfs_lock_file(struct file *filp, unsigned int cmd, struct file_lock *fl, 
> struct file_lock *conf)
>  {

-- 
Jeff Layton 



Re: [Cluster-devel] [RFCv2 3/7] lockd: introduce safe async lock op

2023-08-16 Thread Jeff Layton
>fi_fhandle, nn);
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 11fbd0ee1370..10358a93cdc1 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -3,6 +3,7 @@
>  #define LINUX_EXPORTFS_H 1
>  
>  #include 
> +#include 
>  
>  struct dentry;
>  struct iattr;
> @@ -224,9 +225,16 @@ struct export_operations {
> atomic attribute updates
>   */
>  #define EXPORT_OP_FLUSH_ON_CLOSE (0x20) /* fs flushes file data on close 
> */
> +#define EXPORT_OP_SAFE_ASYNC_LOCK(0x40) /* fs can do async lock request 
> */
>   unsigned long   flags;
>  };
>  
> +static inline bool export_op_support_safe_async_lock(const struct 
> export_operations *export_ops,
> +  const struct 
> file_operations *f_op)
> +{
> + return (export_ops->flags & EXPORT_OP_SAFE_ASYNC_LOCK) || !f_op->lock;
> +}
> +
>  extern int exportfs_encode_inode_fh(struct inode *inode, struct fid *fid,
>   int *max_len, struct inode *parent,
>   int flags);

This seems like a reasonable approach, in principle.

Acked-by: Jeff Layton 



Re: [Cluster-devel] [RFCv2 2/7] lockd: FILE_LOCK_DEFERRED only on FL_SLEEP

2023-08-16 Thread Jeff Layton
On Mon, 2023-08-14 at 17:11 -0400, Alexander Aring wrote:
> This patch removes to handle non-blocking lock requests as asynchronous
> lock request returning FILE_LOCK_DEFERRED. When fl_lmops and lm_grant()
> is set and a non-blocking lock request returns FILE_LOCK_DEFERRED will
> end in an WARNING to signal the user the misusage of the API.
> 

Probably need to rephrase the word salad in the first sentence of the
commit log. I had to go over it a few times to understand what was going
on here.

In any case, I'm guessing that the idea here is that GFS2/DLM shouldn't
ever return FILE_LOCK_DEFERRED if this is a non-wait request (i.e.
someone called F_SETLK instead of F_SETLKW)?

That may be ok, but again, lockd goes to great lengths to avoid blocking
and I think it's generally a good idea. If an upcall to DLM can take a
long time, it might be a good idea to continue to allow a !wait request
to return FILE_LOCK_DEFERRED.

I guess this really depends on the current behavior today though. Does
DLM ever return FILE_LOCK_DEFERRED on a non-blocking lock request?


> The reason why we moving to make non-blocking lock request as
> synchronized call is that we already doing this behaviour for unlock or
> cancellation as well. Those are POSIX lock operations which are handled
> in an synchronized way and waiting for an answer. For non-blocking lock
> requests the answer will probably arrive in the same time as unlock or
> cancellation operations as those are trylock operations only.
> 
> In case of a blocking lock request we need to have it asynchronously
> because the time when the lock request getting granted is unknown.
> 
> Signed-off-by: Alexander Aring 
> ---
>  fs/lockd/svclock.c | 39 +++
>  1 file changed, 7 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index 7d63524bdb81..1e74a578d7de 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -440,31 +440,6 @@ static void nlmsvc_freegrantargs(struct nlm_rqst *call)
>   locks_release_private(&call->a_args.lock.fl);
>  }
>  
> -/*
> - * Deferred lock request handling for non-blocking lock
> - */
> -static __be32
> -nlmsvc_defer_lock_rqst(struct svc_rqst *rqstp, struct nlm_block *block)
> -{
> - __be32 status = nlm_lck_denied_nolocks;
> -
> - block->b_flags |= B_QUEUED;
> -
> - nlmsvc_insert_block(block, NLM_TIMEOUT);
> -
> - block->b_cache_req = &rqstp->rq_chandle;
> - if (rqstp->rq_chandle.defer) {
> - block->b_deferred_req =
> - rqstp->rq_chandle.defer(block->b_cache_req);
> - if (block->b_deferred_req != NULL)
> - status = nlm_drop_reply;
> - }
> - dprintk("lockd: nlmsvc_defer_lock_rqst block %p flags %d status %d\n",
> - block, block->b_flags, ntohl(status));
> -
> - return status;
> -}
> -
>  /*
>   * Attempt to establish a lock, and if it can't be granted, block it
>   * if required.
> @@ -569,14 +544,14 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file 
> *file,
>   ret = async_block ? nlm_lck_blocked : nlm_lck_denied;
>   goto out_cb_mutex;
>   case FILE_LOCK_DEFERRED:
> - block->b_flags |= B_PENDING_CALLBACK;
> + /* lock requests without waiters are handled in
> +  * a non async way. Let assert this to inform
> +  * the user about a API violation.
> +  */
> + WARN_ON_ONCE(!wait);
>  
> - if (wait)
> - break;
> - /* Filesystem lock operation is in progress
> -Add it to the queue waiting for callback */
> - ret = nlmsvc_defer_lock_rqst(rqstp, block);
> - goto out_cb_mutex;
> + block->b_flags |= B_PENDING_CALLBACK;
> + break;
>   case -EDEADLK:
>   nlmsvc_remove_block(block);
>   ret = nlm_deadlock;

-- 
Jeff Layton 



Re: [Cluster-devel] [RFCv2 1/7] lockd: fix race in async lock request handling

2023-08-15 Thread Jeff Layton
On Tue, 2023-08-15 at 13:49 -0400, Jeff Layton wrote:
> On Mon, 2023-08-14 at 17:11 -0400, Alexander Aring wrote:
> > This patch fixes a race in async lock request handling between adding
> > the relevant struct nlm_block to nlm_blocked list after the request was
> > sent by vfs_lock_file() and nlmsvc_grant_deferred() does a lookup of the
> > nlm_block in the nlm_blocked list. It could be that the async request is
> > completed before the nlm_block was added to the list. This would end
> > in a -ENOENT and a kernel log message of "lockd: grant for unknown
> > block".
> > 
> > To solve this issue we add the nlm_block before the vfs_lock_file() call
> > to be sure it has been added when a possible nlmsvc_grant_deferred() is
> > called. If the vfs_lock_file() results in an case when it wouldn't be
> > added to nlm_blocked list, the nlm_block struct will be removed from
> > this list again.
> > 
> > The introducing of the new B_PENDING_CALLBACK nlm_block flag will handle
> > async lock requests on a pending lock requests as a retry on the caller
> > level to hit the -EAGAIN case.
> > 
> > Signed-off-by: Alexander Aring 
> > ---
> >  fs/lockd/svclock.c  | 100 ++--
> >  include/linux/lockd/lockd.h |   2 +
> >  2 files changed, 74 insertions(+), 28 deletions(-)
> > 
> > 

[...]

> > diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> > index f42594a9efe0..91f55458f5fc 100644
> > --- a/include/linux/lockd/lockd.h
> > +++ b/include/linux/lockd/lockd.h
> > @@ -185,10 +185,12 @@ struct nlm_block {
> > struct nlm_file *   b_file; /* file in question */
> > struct cache_req *  b_cache_req;/* deferred request handling */
> > struct cache_deferred_req * b_deferred_req
> > +   struct mutexb_cb_mutex; /* callback mutex */
> 
> There is no mention at all of this new mutex in the changelog or
> comments. It's not at all clear to me what this is intended to protect.
> In general, with lockd being a single-threaded service, we want to avoid
> sleeping locks. This will need some clear justification.
> 
> At a glance, it looks like you're trying to use this to hold
> B_PENDING_CALLBACK steady while a lock request is being handled. That
> suggests that you're using this mutex to serialize access to a section
> of code and not one or more specific data structures. We usually like to
> avoid that sort of thing, since locks that protect arbitrary sections of
> code become difficult to work with over time.
> 
> I'm going to go out on a limb here though and suggest that there is
> probably a way to solve this problem that doesn't involve adding new
> locks.
> 
> > unsigned intb_flags;/* block flags */
> >  #define B_QUEUED   1   /* lock queued */
> >  #define B_GOT_CALLBACK 2   /* got lock or conflicting lock 
> > */
> >  #define B_TIMED_OUT4   /* filesystem too slow to 
> > respond */
> > +#define B_PENDING_CALLBACK 8   /* pending callback for lock request */
> >  };
> >  
> >  /*
> 
> Do we need this new flag at all? It seems redundant. If we have a block
> on the list, then it is sort of by definition "pending callback". If
> it's not on the list anymore, then it's not. No?
> 

Do we need anything more than a patch along these lines? Note that this
is untested, so RFC:

-8<---

[RFC PATCH] lockd: alternate fix for race between deferred lock and grant

Signed-off-by: Jeff Layton 
---
 fs/lockd/svclock.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index c43ccdf28ed9..e9a84363c26e 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -446,6 +446,8 @@ nlmsvc_defer_lock_rqst(struct svc_rqst *rqstp, struct 
nlm_block *block)
 
block->b_flags |= B_QUEUED;
 
+   /* FIXME: remove and reinsert w/o dropping spinlock */
+   nlmsvc_remove_block(block);
nlmsvc_insert_block(block, NLM_TIMEOUT);
 
block->b_cache_req = &rqstp->rq_chandle;
@@ -535,6 +537,9 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file *file,
if (!wait)
lock->fl.fl_flags &= ~FL_SLEEP;
mode = lock_to_openmode(&lock->fl);
+
+   /* Append to list of blocked */
+   nlmsvc_insert_block(block, NLM_NEVER);
error = vfs_lock_file(file->f_file[mode], F_SETLK, &lock->fl, NULL);
lock->fl.fl_flags &= ~FL_SLEEP;
 
@@ -542,6 +547,7 @@ nlmsvc_lock(struct svc_rqst *rqst

Re: [Cluster-devel] [RFCv2 1/7] lockd: fix race in async lock request handling

2023-08-15 Thread Jeff Layton
fault:/* includes ENOLCK */
> + nlmsvc_remove_block(block);
>   ret = nlm_lck_denied_nolocks;
> - goto out;
> + goto out_cb_mutex;
>   }
>  
>   ret = nlm_lck_blocked;
> -
> - /* Append to list of blocked */
> - nlmsvc_insert_block(block, NLM_NEVER);
> +out_cb_mutex:
> + mutex_unlock(&block->b_cb_mutex);
>  out:
>   mutex_unlock(&file->f_mutex);
>   nlmsvc_release_block(block);
> @@ -728,34 +746,60 @@ nlmsvc_update_deferred_block(struct nlm_block *block, 
> int result)
>   block->b_flags |= B_TIMED_OUT;
>  }
>  
> +static int __nlmsvc_grant_deferred(struct nlm_block *block,
> +struct file_lock *fl,
> +int result)
> +{
> + int rc = 0;
> +
> + dprintk("lockd: nlmsvc_notify_blocked block %p flags %d\n",
> + block, block->b_flags);
> + if (block->b_flags & B_QUEUED) {
> + if (block->b_flags & B_TIMED_OUT) {
> + rc = -ENOLCK;
> + goto out;
> + }
> + nlmsvc_update_deferred_block(block, result);
> + } else if (result == 0)
> + block->b_granted = 1;
> +
> + nlmsvc_insert_block_locked(block, 0);
> + svc_wake_up(block->b_daemon);
> +out:
> + return rc;
> +}
> +
>  static int nlmsvc_grant_deferred(struct file_lock *fl, int result)
>  {
> - struct nlm_block *block;
> - int rc = -ENOENT;
> + struct nlm_block *iter, *block = NULL;
> + int rc;
>  
>   spin_lock(&nlm_blocked_lock);
> - list_for_each_entry(block, &nlm_blocked, b_list) {
> - if (nlm_compare_locks(&block->b_call->a_args.lock.fl, fl)) {
> - dprintk("lockd: nlmsvc_notify_blocked block %p flags 
> %d\n",
> - block, block->b_flags);
> - if (block->b_flags & B_QUEUED) {
> - if (block->b_flags & B_TIMED_OUT) {
> - rc = -ENOLCK;
> - break;
> - }
> - nlmsvc_update_deferred_block(block, result);
> - } else if (result == 0)
> - block->b_granted = 1;
> -
> - nlmsvc_insert_block_locked(block, 0);
> - svc_wake_up(block->b_daemon);
> - rc = 0;
> + list_for_each_entry(iter, &nlm_blocked, b_list) {
> + if (nlm_compare_locks(&iter->b_call->a_args.lock.fl, fl)) {
> + kref_get(&iter->b_count);
> + block = iter;
>   break;
>   }
>   }
>   spin_unlock(&nlm_blocked_lock);
> - if (rc == -ENOENT)
> - printk(KERN_WARNING "lockd: grant for unknown block\n");
> +
> + if (!block) {
> + pr_warn("lockd: grant for unknown pending block\n");
> + return -ENOENT;
> + }
> +
> + /* don't interfere with nlmsvc_lock() */
> + mutex_lock(&block->b_cb_mutex);
> + block->b_flags &= ~B_PENDING_CALLBACK;
> +
> + spin_lock(&nlm_blocked_lock);
> + WARN_ON_ONCE(list_empty(&block->b_list));
> + rc = __nlmsvc_grant_deferred(block, fl, result);
> + spin_unlock(&nlm_blocked_lock);
> + mutex_unlock(&block->b_cb_mutex);
> +
> + nlmsvc_release_block(block);
>   return rc;
>  }
>  
> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index f42594a9efe0..91f55458f5fc 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -185,10 +185,12 @@ struct nlm_block {
>   struct nlm_file *   b_file; /* file in question */
>   struct cache_req *  b_cache_req;/* deferred request handling */
>   struct cache_deferred_req * b_deferred_req
> + struct mutexb_cb_mutex; /* callback mutex */

There is no mention at all of this new mutex in the changelog or
comments. It's not at all clear to me what this is intended to protect.
In general, with lockd being a single-threaded service, we want to avoid
sleeping locks. This will need some clear justification.

At a glance, it looks like you're trying to use this to hold
B_PENDING_CALLBACK steady while a lock request is being handled. That
suggests that you're using this mutex to serialize access to a section
of code and not one or more specific data structures. We usually like to
avoid that sort of thing, since locks that protect arbitrary sections of
code become difficult to work with over time.

I'm going to go out on a limb here though and suggest that there is
probably a way to solve this problem that doesn't involve adding new
locks.

>   unsigned intb_flags;/* block flags */
>  #define B_QUEUED 1   /* lock queued */
>  #define B_GOT_CALLBACK   2   /* got lock or conflicting lock 
> */
>  #define B_TIMED_OUT  4   /* filesystem too slow to respond */
> +#define B_PENDING_CALLBACK   8   /* pending callback for lock request */
>  };
>  
>  /*

Do we need this new flag at all? It seems redundant. If we have a block
on the list, then it is sort of by definition "pending callback". If
it's not on the list anymore, then it's not. No?

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-09 Thread Jeff Layton
On Thu, 2023-08-10 at 05:14 +0900, OGAWA Hirofumi wrote:
> Jeff Layton  writes:
> 
> > When you say it "doesn't work the same", what do you mean, specifically?
> > I had to make some allowances for the fact that FAT is substantially
> > different in its timestamp handling, and I tried to preserve existing
> > behavior as best I could.
> 
> Ah, ok. I was misreading some.
> 
> inode_update_timestamps() checks IS_I_VERSION() now, not S_VERSION.  So,
> if adding the check of IS_I_VERSION() and (S_MTIME|S_CTIME|S_VERSION) to
> FAT?
> 
> With it, IS_I_VERSION() would be false on FAT, and I'm fine.
> 
> I.e. something like
> 
>   if ((flags & (S_VERSION|S_CTIME|S_MTIME)) && IS_I_VERSION(inode)
>   && inode_maybe_inc_iversion(inode, false))
>   dirty_flags |= I_DIRTY_SYNC;
> 
> Thanks.

If you do that then the i_version counter would never be incremented.
But...I think I see what you're getting at.

Most filesystems that support the i_version counter have an on-disk
field for it. FAT obviously has no such thing. I suspect the i_version
bits in fat_update_time were added by mistake. FAT doesn't set
SB_I_VERSION so there's no need to do anything to the i_version field at
all.

Also, given that the mtime and ctime are always kept in sync on FAT,
we're probably fine to have it look something like this:

8<--
int fat_update_time(struct inode *inode, int flags) 
{ 
int dirty_flags = 0;

if (inode->i_ino == MSDOS_ROOT_INO) 
return 0;

fat_truncate_time(inode, NULL, flags);
if (inode->i_sb->s_flags & SB_LAZYTIME)
dirty_flags |= I_DIRTY_TIME;
else
dirty_flags |= I_DIRTY_SYNC;

__mark_inode_dirty(inode, dirty_flags);
return 0;
} 
8<--

...and we should probably do that in a separate patch in advance of the
update_time rework, since it's really a different change.

If you're in agreement, then I'll plan to respin the series with this
fixed and resend.

Thanks for being patient!
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-09 Thread Jeff Layton
On Thu, 2023-08-10 at 03:31 +0900, OGAWA Hirofumi wrote:
> Jeff Layton  writes:
> 
> > On Thu, 2023-08-10 at 02:44 +0900, OGAWA Hirofumi wrote:
> > > Jeff Layton  writes:
> > > 
> > That would be wrong. The problem is that we're changing how update_time
> > works:
> > 
> > Previously, update_time was given a timestamp and a set of S_* flags to
> > indicate which fields should be updated. Now, update_time is not given a
> > timestamp. It needs to fetch it itself, but that subtly changes the
> > meaning of the flags field.
> > 
> > It now means "these fields needed to be updated when I last checked".
> > The timestamp and i_version may now be different from when the flags
> > field was set. This means that if any of S_CTIME/S_MTIME/S_VERSION were
> > set that we need to attempt to update all 3 of them. They may now be
> > different from the timestamp or version that we ultimately end up with.
> > 
> > The above may look to you like it would always cause I_DIRTY_SYNC to be
> > set on any ctime or mtime update, but inode_maybe_inc_iversion only
> > returns true if it actually updated i_version, and it only does that if
> > someone issued a ->getattr against the file since the last time it was
> > updated.
> > 
> > So, this shouldn't generate any more DIRTY_SYNC updates than it did
> > before.
> 
> Again, if you claim so, why generic_update_time() doesn't work same? Why
> only FAT does?
> 
> Or I'm misreading generic_update_time() patch?
> 

When you say it "doesn't work the same", what do you mean, specifically?
I had to make some allowances for the fact that FAT is substantially
different in its timestamp handling, and I tried to preserve existing
behavior as best I could.

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 08/13] fs: drop the timespec64 argument from update_time

2023-08-09 Thread Jeff Layton
Yes. It's in Christian's vfs.ctime branch:

https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=vfs.ctime

On Wed, 2023-08-09 at 14:38 -0400, Mike Marshall wrote:
> I've been following this patch on fsdevel... is there a
> remote I could fetch with a branch that has this in it?
> 
> -Mike
> 
> On Wed, Aug 9, 2023 at 8:32 AM Christian Brauner  wrote:
> > 
> > On Mon, Aug 07, 2023 at 03:38:39PM -0400, Jeff Layton wrote:
> > > Now that all of the update_time operations are prepared for it, we can
> > > drop the timespec64 argument from the update_time operation. Do that and
> > > remove it from some associated functions like inode_update_time and
> > > inode_needs_update_time.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/bad_inode.c   |  3 +--
> > >  fs/btrfs/inode.c |  3 +--
> > >  fs/btrfs/volumes.c   |  4 +---
> > >  fs/fat/fat.h |  3 +--
> > >  fs/fat/misc.c|  2 +-
> > >  fs/gfs2/inode.c  |  3 +--
> > >  fs/inode.c   | 30 +-
> > >  fs/overlayfs/inode.c |  2 +-
> > >  fs/overlayfs/overlayfs.h |  2 +-
> > >  fs/ubifs/file.c  |  3 +--
> > >  fs/ubifs/ubifs.h |  2 +-
> > >  fs/xfs/xfs_iops.c|  1 -
> > >  include/linux/fs.h   |  4 ++--
> > 
> > This was missing the conversion of fs/orangefs orangefs_update_time()
> > causing the build to fail. So at some point kbuild will yell here.
> > Fwiw, I've fixed that up in-tree.

Cheers,
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-09 Thread Jeff Layton
On Thu, 2023-08-10 at 02:44 +0900, OGAWA Hirofumi wrote:
> Jeff Layton  writes:
> 
> > On Thu, 2023-08-10 at 00:17 +0900, OGAWA Hirofumi wrote:
> > > Jan Kara  writes:
> 
> [...]
> 
> > My mistake re: lazytime vs. relatime, but Jan is correct that this
> > shouldn't break anything there.
> 
> Actually breaks ("break" means not corrupt fs, means it breaks lazytime
> optimization). It is just not always, but it should be always for some
> userspaces.
> 
> > The logic in the revised generic_update_time is different because FAT is
> > is a bit strange. fat_update_time does extra truncation on the timestamp
> > that it is handed beyond what timestamp_truncate() does.
> > fat_truncate_time is called in many different places too, so I don't
> > feel comfortable making big changes to how that works.
> > 
> > In the case of generic_update_time, it calls inode_update_timestamps
> > which returns a mask that shows which timestamps got updated. It then
> > marks the dirty_flags appropriately for what was actually changed.
> > 
> > generic_update_time is used across many filesystems so we need to ensure
> > that it's OK to use even when multigrain timestamps are enabled. Those
> > haven't been enabled in FAT though, so I didn't bother, and left it to
> > dirtying the inode in the same way it was before, even though it now
> > fetches its own timestamps from the clock. Given the way that the mtime
> > and ctime are smooshed together in FAT, that seemed reasonable.
> > 
> > Is there a particular case or flag combination you're concerned about
> > here?
> 
> Yes. Because FAT has strange timestamps that different granularity on
> disk . This is why generic time truncation doesn't work for FAT.
> 
> Well anyway, my concern is the only following part. In
> generic_update_time(), S_[CM]TIME are not the cause of I_DIRTY_SYNC if
> lazytime mode.
> 
> - if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> + if ((flags & (S_VERSION|S_CTIME|S_MTIME)) && 
> inode_maybe_inc_iversion(inode, false))
>   dirty_flags |= I_DIRTY_SYNC;
> 

That would be wrong. The problem is that we're changing how update_time
works:

Previously, update_time was given a timestamp and a set of S_* flags to
indicate which fields should be updated. Now, update_time is not given a
timestamp. It needs to fetch it itself, but that subtly changes the
meaning of the flags field.

It now means "these fields needed to be updated when I last checked".
The timestamp and i_version may now be different from when the flags
field was set. This means that if any of S_CTIME/S_MTIME/S_VERSION were
set that we need to attempt to update all 3 of them. They may now be
different from the timestamp or version that we ultimately end up with.

The above may look to you like it would always cause I_DIRTY_SYNC to be
set on any ctime or mtime update, but inode_maybe_inc_iversion only
returns true if it actually updated i_version, and it only does that if
someone issued a ->getattr against the file since the last time it was
updated.

So, this shouldn't generate any more DIRTY_SYNC updates than it did
before.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-09 Thread Jeff Layton
On Thu, 2023-08-10 at 00:17 +0900, OGAWA Hirofumi wrote:
> Jan Kara  writes:
> 
> > Since you are talking past one another with Jeff let me chime in here :). I
> > think you are worried about this hunk:
> 
> Right.
>
> > -   if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> > +   if ((flags & (S_VERSION|S_CTIME|S_MTIME)) && 
> > inode_maybe_inc_iversion(inode, false))
> > dirty_flags |= I_DIRTY_SYNC;
> > 
> > which makes the 'flags' test pass even if we just modified ctime or mtime.
> > But do note the second part of the if - inode_maybe_inc_iversion() - so we
> > are going to mark the inode dirty with I_DIRTY_SYNC only if someone queried
> > iversion since the last time we have incremented it.
> > 
> > So this hunk is not really changing how inode is marked dirty, it only
> > changes how often we check whether iversion needs increment and that should
> > be fine (and desirable). Hence lazytime isn't really broken by this in any
> > way.
> 
> OK. However, then it doesn't explain what I asked. This is not same with
> generic_update_time(), only FAT does.
>
> If thinks it is right thing, why generic_update_time() doesn't? I said
> first reply, this was from generic_update_time(). (Or I'm misreading
> updated generic_update_time()?)
> 

My mistake re: lazytime vs. relatime, but Jan is correct that this
shouldn't break anything there.

The logic in the revised generic_update_time is different because FAT is
is a bit strange. fat_update_time does extra truncation on the timestamp
that it is handed beyond what timestamp_truncate() does.
fat_truncate_time is called in many different places too, so I don't
feel comfortable making big changes to how that works.

In the case of generic_update_time, it calls inode_update_timestamps
which returns a mask that shows which timestamps got updated. It then
marks the dirty_flags appropriately for what was actually changed.

generic_update_time is used across many filesystems so we need to ensure
that it's OK to use even when multigrain timestamps are enabled. Those
haven't been enabled in FAT though, so I didn't bother, and left it to
dirtying the inode in the same way it was before, even though it now
fetches its own timestamps from the clock. Given the way that the mtime
and ctime are smooshed together in FAT, that seemed reasonable.

Is there a particular case or flag combination you're concerned about
here?
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-09 Thread Jeff Layton
On Wed, 2023-08-09 at 22:36 +0900, OGAWA Hirofumi wrote:
> Jeff Layton  writes:
> 
> > On Wed, 2023-08-09 at 17:37 +0900, OGAWA Hirofumi wrote:
> > > Jeff Layton  writes:
> > > 
> > > > Also, it may be that things have changed by the time we get to calling
> > > > fat_update_time after checking inode_needs_update_time. Ensure that we
> > > > attempt the i_version bump if any of the S_* flags besides S_ATIME are
> > > > set.
> > > 
> > > I'm not sure what it meaning though, this is from
> > > generic_update_time(). Are you going to change generic_update_time()
> > > too? If so, it doesn't break lazytime feature?
> > > 
> > 
> > Yes. generic_update_time is also being changed in a similar fashion.
> > This shouldn't break the lazytime feature: lazytime is all about how and
> > when timestamps get written to disk. This work is all about which
> > clocksource the timestamps originally come from.
> 
> I can only find the following update in this series, another series
> updates generic_update_time()? The patch updates only if S_VERSION is
> set.
> 
> Your fat patch sets I_DIRTY_SYNC always instead of I_DIRTY_TIME. When I
> last time checked lazytime, and it was depending on I_DIRTY_TIME.
> 
> Are you sure it doesn't break lazytime? I'm totally confusing, and
> really similar with generic_update_time()?
> 

I'm a little confused too. Why do you believe this will break
-o relatime handling? This patch changes two things:

1/ it has fat_update_time fetch its own timestamp (and ignore the "now"
parameter). This is in line with the changes in patch #3 of this series,
which explains the rationale for this in more detail.

2/ it changes fat_update_time to also update the i_version if any of
S_CTIME|S_MTIME|S_VERSION are set. relatime is all about the S_ATIME,
and it is specifically excluded from that set.

The rationale for the second change is is also in patch #3, but
basically, we can't guarantee that current_time hasn't changed since we
last checked for inode_needs_update_time, so if any of
S_CTIME/S_MTIME/S_VERSION have changed, then we need to assume that any
of them may need to be changed and attempt to update all 3.

That said, I think the logic in fat_update_time isn't quite right. I
think want something like this on top of this patch to ensure that the
S_CTIME and S_MTIME get updated, even if the flags only have S_VERSION
set.

Thoughts?

-8<---

diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index 080a5035483f..313eef02f45c 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -346,15 +346,21 @@ int fat_update_time(struct inode *inode, int flags)
if (inode->i_ino == MSDOS_ROOT_INO)
return 0;
 
-   if (flags & (S_ATIME | S_CTIME | S_MTIME)) {
-   fat_truncate_time(inode, NULL, flags);
-   if (inode->i_sb->s_flags & SB_LAZYTIME)
-   dirty_flags |= I_DIRTY_TIME;
-   else
-   dirty_flags |= I_DIRTY_SYNC;
-   }
+   /*
+* If any of the flags indicate an expicit change to the file, then we
+* need to ensure that we attempt to update all of 3. We do not do
+* this in the case of an S_ATIME-only update.
+*/
+   if (flags & (S_CTIME | S_MTIME | S_VERSION))
+   flags |= S_CTIME | S_MTIME | S_VERSION;
+
+   fat_truncate_time(inode, NULL, flags);
+   if (inode->i_sb->s_flags & SB_LAZYTIME)
+   dirty_flags |= I_DIRTY_TIME;
+   else
+   dirty_flags |= I_DIRTY_SYNC;
 
-   if ((flags & (S_VERSION|S_CTIME|S_MTIME)) && 
inode_maybe_inc_iversion(inode, false))
+   if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
dirty_flags |= I_DIRTY_SYNC;



Re: [Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-09 Thread Jeff Layton
On Wed, 2023-08-09 at 17:37 +0900, OGAWA Hirofumi wrote:
> Jeff Layton  writes:
> 
> > Also, it may be that things have changed by the time we get to calling
> > fat_update_time after checking inode_needs_update_time. Ensure that we
> > attempt the i_version bump if any of the S_* flags besides S_ATIME are
> > set.
> 
> I'm not sure what it meaning though, this is from
> generic_update_time(). Are you going to change generic_update_time()
> too? If so, it doesn't break lazytime feature?
> 

Yes. generic_update_time is also being changed in a similar fashion.
This shouldn't break the lazytime feature: lazytime is all about how and
when timestamps get written to disk. This work is all about which
clocksource the timestamps originally come from.

> Thanks.
> 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/fat/misc.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/fat/misc.c b/fs/fat/misc.c
> > index 67006ea08db6..8cab87145d63 100644
> > --- a/fs/fat/misc.c
> > +++ b/fs/fat/misc.c
> > @@ -347,14 +347,14 @@ int fat_update_time(struct inode *inode, struct 
> > timespec64 *now, int flags)
> > return 0;
> >  
> > if (flags & (S_ATIME | S_CTIME | S_MTIME)) {
> > -   fat_truncate_time(inode, now, flags);
> > +   fat_truncate_time(inode, NULL, flags);
> > if (inode->i_sb->s_flags & SB_LAZYTIME)
> > dirty_flags |= I_DIRTY_TIME;
> > else
> > dirty_flags |= I_DIRTY_SYNC;
> > }
> >  
> > -   if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
> > +   if ((flags & (S_VERSION|S_CTIME|S_MTIME)) && 
> > inode_maybe_inc_iversion(inode, false))
> > dirty_flags |= I_DIRTY_SYNC;
> >  
> > __mark_inode_dirty(inode, dirty_flags);
> 

-- 
Jeff Layton 



[Cluster-devel] [PATCH v7 09/13] fs: add infrastructure for multigrain timestamps

2023-08-07 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried.

POSIX generally mandates that when the the mtime changes, the ctime must
also change. The kernel always stores normalized ctime values, so only
the first 30 bits of the tv_nsec field are ever used.

Use the 31st bit of the ctime tv_nsec field to indicate that something
has queried the inode for the mtime or ctime. When this flag is set,
on the next mtime or ctime update, the kernel will fetch a fine-grained
timestamp instead of the usual coarse-grained one.

Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.

Later patches will convert individual filesystems to use the new
infrastructure.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 82 --
 fs/stat.c  | 41 +--
 include/linux/fs.h | 46 --
 3 files changed, 162 insertions(+), 7 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e50d94a136fe..f55957ac80e6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2118,10 +2118,52 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * current_mgtime - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+struct timespec64 current_mgtime(struct inode *inode)
+{
+   struct timespec64 now, ctime;
+   atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec;
+   long nsec = atomic_long_read(pnsec);
+
+   if (nsec & I_CTIME_QUERIED) {
+   ktime_get_real_ts64(&now);
+   return timestamp_truncate(now, inode);
+   }
+
+   ktime_get_coarse_real_ts64(&now);
+   now = timestamp_truncate(now, inode);
+
+   /*
+* If we've recently fetched a fine-grained timestamp
+* then the coarse-grained one may still be earlier than the
+* existing ctime. Just keep the existing value if so.
+*/
+   ctime = inode_get_ctime(inode);
+   if (timespec64_compare(&ctime, &now) > 0)
+   now = ctime;
+
+   return now;
+}
+EXPORT_SYMBOL(current_mgtime);
+
+static struct timespec64 current_ctime(struct inode *inode)
+{
+   if (is_mgtime(inode))
+   return current_mgtime(inode);
+   return current_time(inode);
+}
+
 static int inode_needs_update_time(struct inode *inode)
 {
int sync_it = 0;
-   struct timespec64 now = current_time(inode);
+   struct timespec64 now = current_ctime(inode);
struct timespec64 ctime;
 
/* First try to exhaust all avenues to not sync */
@@ -2552,9 +2594,43 @@ EXPORT_SYMBOL(current_time);
  */
 struct timespec64 inode_set_ctime_current(struct inode *inode)
 {
-   struct timespec64 now = current_time(inode);
+   struct timespec64 now;
+   struct timespec64 ctime;
+
+   ctime.tv_nsec = READ_ONCE(inode->__i_ctime.tv_nsec);
+   if (!(ctime.tv_nsec & I_CTIME_QUERIED)) {
+   now = current_time(inode);
 
-   inode_set_ctime(inode, now.tv_sec, now.tv_nsec);
+   /* Just copy it into place if it's not multigrain */
+   if (!is_mgtime(inode)) {
+   inode_set_ctime_to_ts(inode, now);
+   return now;
+   }
+
+   /*
+* If we've recently updated with a fine-grained timestamp,
+* then the coarse-grained one may still be earlier than the
+* existing ctime. Just keep the existing value if so.
+*/
+   ctime.tv_sec = inode->__i_ctime.tv_sec;
+   if (timespec64_compare(&ctime, &now) > 

[Cluster-devel] [PATCH v7 12/13] ext4: switch to multigrain timestamps

2023-08-07 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Acked-by: Theodore Ts'o 
Reviewed-by: Jan Kara 
Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b54c70e1a74e..cb1ff47af156 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7279,7 +7279,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.41.0



[Cluster-devel] [PATCH v7 13/13] btrfs: convert to multigrain timestamps

2023-08-07 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Signed-off-by: Jeff Layton 
Acked-by: David Sterba 
---
 fs/btrfs/file.c  | 24 
 fs/btrfs/super.c |  5 +++--
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d7a9ece7a40b..b9e75c9f95ac 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1106,25 +1106,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ctime;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   if (!timespec64_equal(&inode->i_mtime, &now))
-   inode->i_mtime = now;
-
-   ctime = inode_get_ctime(inode);
-   if (!timespec64_equal(&ctime, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1156,7 +1137,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode->i_mtime = inode_set_ctime_current(inode);
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f1dd172d8d5b..8eda51b095c9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2144,7 +2144,7 @@ static struct file_system_type btrfs_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_MGTIME,
 };
 
 static struct file_system_type btrfs_root_fs_type = {
@@ -2152,7 +2152,8 @@ static struct file_system_type btrfs_root_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount_root,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.41.0



[Cluster-devel] [PATCH v7 02/13] fs: pass the request_mask to generic_fillattr

2023-08-07 Thread Jeff Layton
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Acked-by: Joseph Qi 
Reviewed-by: Xiubo Li 
Reviewed-by: "Paulo Alcantara (SUSE)" 
Reviewed-by: Jan Kara 
Signed-off-by: Jeff Layton 
---
 fs/9p/vfs_inode.c   |  4 ++--
 fs/9p/vfs_inode_dotl.c  |  4 ++--
 fs/afs/inode.c  |  2 +-
 fs/btrfs/inode.c|  2 +-
 fs/ceph/inode.c |  2 +-
 fs/coda/inode.c |  3 ++-
 fs/ecryptfs/inode.c |  5 +++--
 fs/erofs/inode.c|  2 +-
 fs/exfat/file.c |  2 +-
 fs/ext2/inode.c |  2 +-
 fs/ext4/inode.c |  2 +-
 fs/f2fs/file.c  |  2 +-
 fs/fat/file.c   |  2 +-
 fs/fuse/dir.c   |  2 +-
 fs/gfs2/inode.c |  2 +-
 fs/hfsplus/inode.c  |  2 +-
 fs/kernfs/inode.c   |  2 +-
 fs/libfs.c  |  4 ++--
 fs/minix/inode.c|  2 +-
 fs/nfs/inode.c  |  2 +-
 fs/nfs/namespace.c  |  3 ++-
 fs/ntfs3/file.c |  2 +-
 fs/ocfs2/file.c |  2 +-
 fs/orangefs/inode.c |  2 +-
 fs/proc/base.c  |  4 ++--
 fs/proc/fd.c|  2 +-
 fs/proc/generic.c   |  2 +-
 fs/proc/proc_net.c  |  2 +-
 fs/proc/proc_sysctl.c   |  2 +-
 fs/proc/root.c  |  3 ++-
 fs/smb/client/inode.c   |  2 +-
 fs/smb/server/smb2pdu.c | 22 +++---
 fs/smb/server/vfs.c |  3 ++-
 fs/stat.c   | 24 +---
 fs/sysv/itree.c |  3 ++-
 fs/ubifs/dir.c  |  2 +-
 fs/udf/symlink.c|  2 +-
 fs/vboxsf/utils.c   |  2 +-
 include/linux/fs.h  |  2 +-
 mm/shmem.c  |  2 +-
 40 files changed, 73 insertions(+), 65 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 16d85e6033a3..d24d1f20e922 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1016,7 +1016,7 @@ v9fs_vfs_getattr(struct mnt_idmap *idmap, const struct 
path *path,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) {
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
return 0;
} else if (v9ses->cache & CACHE_WRITEBACK) {
if (S_ISREG(inode->i_mode)) {
@@ -1037,7 +1037,7 @@ v9fs_vfs_getattr(struct mnt_idmap *idmap, const struct 
path *path,
return PTR_ERR(st);
 
v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb, 0);
-   generic_fillattr(&nop_mnt_idmap, d_inode(dentry), stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(dentry), stat);
 
p9stat_free(st);
kfree(st);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 464ea73d1bf8..8e8d5d2a13d8 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -451,7 +451,7 @@ v9fs_vfs_getattr_dotl(struct mnt_idmap *idmap,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) {
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
return 0;
} else if (v9ses->cache) {
if (S_ISREG(inode->i_mode)) {
@@ -476,7 +476,7 @@ v9fs_vfs_getattr_dotl(struct mnt_idmap *idmap,
return PTR_ERR(st);
 
v9fs_stat2inode_dotl(st, d_inode(dentry), 0);
-   generic_fillattr(&nop_mnt_idmap, d_inode(dentry), stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(dentry), stat);
/* Change block size to what the server returned */
stat->blksize = st->st_blksize;
 
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 6b636f43f548..1c794a1896aa 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -773,7 +773,7 @@ int afs_getattr(struct mnt_idmap *idmap, const struct path 
*path,
 
do {
read_seqbegin_or_lock(&vnode->cb_lock, &seq);
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
if (test_bit(AFS_VNODE_SILLY_DELETED, &vnode->flags) &&
stat->nlink > 0)
stat->nlink -= 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index c

[Cluster-devel] [PATCH v7 08/13] fs: drop the timespec64 argument from update_time

2023-08-07 Thread Jeff Layton
Now that all of the update_time operations are prepared for it, we can
drop the timespec64 argument from the update_time operation. Do that and
remove it from some associated functions like inode_update_time and
inode_needs_update_time.

Signed-off-by: Jeff Layton 
---
 fs/bad_inode.c   |  3 +--
 fs/btrfs/inode.c |  3 +--
 fs/btrfs/volumes.c   |  4 +---
 fs/fat/fat.h |  3 +--
 fs/fat/misc.c|  2 +-
 fs/gfs2/inode.c  |  3 +--
 fs/inode.c   | 30 +-
 fs/overlayfs/inode.c |  2 +-
 fs/overlayfs/overlayfs.h |  2 +-
 fs/ubifs/file.c  |  3 +--
 fs/ubifs/ubifs.h |  2 +-
 fs/xfs/xfs_iops.c|  1 -
 include/linux/fs.h   |  4 ++--
 13 files changed, 25 insertions(+), 37 deletions(-)

diff --git a/fs/bad_inode.c b/fs/bad_inode.c
index 6e21f7412a85..83f9566c973b 100644
--- a/fs/bad_inode.c
+++ b/fs/bad_inode.c
@@ -133,8 +133,7 @@ static int bad_inode_fiemap(struct inode *inode,
return -EIO;
 }
 
-static int bad_inode_update_time(struct inode *inode, struct timespec64 *time,
-int flags)
+static int bad_inode_update_time(struct inode *inode, int flags)
 {
return -EIO;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d52e7d64570a..0964c66411a1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6059,8 +6059,7 @@ static int btrfs_dirty_inode(struct btrfs_inode *inode)
  * This is a copy of file_update_time.  We need this so we can return error on
  * ENOSPC for updating the inode in the case of file write and mmap writes.
  */
-static int btrfs_update_time(struct inode *inode, struct timespec64 *now,
-int flags)
+static int btrfs_update_time(struct inode *inode, int flags)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
bool dirty = flags & ~S_VERSION;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73f9ea7672db..264c71590370 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1917,15 +1917,13 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle 
*trans,
 static void update_dev_time(const char *device_path)
 {
struct path path;
-   struct timespec64 now;
int ret;
 
ret = kern_path(device_path, LOOKUP_FOLLOW, &path);
if (ret)
return;
 
-   now = current_time(d_inode(path.dentry));
-   inode_update_time(d_inode(path.dentry), &now, S_MTIME | S_CTIME | 
S_VERSION);
+   inode_update_time(d_inode(path.dentry), S_MTIME | S_CTIME | S_VERSION);
path_put(&path);
 }
 
diff --git a/fs/fat/fat.h b/fs/fat/fat.h
index e3b690b48e3e..66cf4778cf3b 100644
--- a/fs/fat/fat.h
+++ b/fs/fat/fat.h
@@ -460,8 +460,7 @@ extern struct timespec64 fat_truncate_mtime(const struct 
msdos_sb_info *sbi,
const struct timespec64 *ts);
 extern int fat_truncate_time(struct inode *inode, struct timespec64 *now,
 int flags);
-extern int fat_update_time(struct inode *inode, struct timespec64 *now,
-  int flags);
+extern int fat_update_time(struct inode *inode, int flags);
 extern int fat_sync_bhs(struct buffer_head **bhs, int nr_bhs);
 
 int fat_cache_init(void);
diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index 8cab87145d63..080a5035483f 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -339,7 +339,7 @@ int fat_truncate_time(struct inode *inode, struct 
timespec64 *now, int flags)
 }
 EXPORT_SYMBOL_GPL(fat_truncate_time);
 
-int fat_update_time(struct inode *inode, struct timespec64 *now, int flags)
+int fat_update_time(struct inode *inode, int flags)
 {
int dirty_flags = 0;
 
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index f1f04557aa21..a21ac41d6669 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -2139,8 +2139,7 @@ loff_t gfs2_seek_hole(struct file *file, loff_t offset)
return vfs_setpos(file, ret, inode->i_sb->s_maxbytes);
 }
 
-static int gfs2_update_time(struct inode *inode, struct timespec64 *time,
-   int flags)
+static int gfs2_update_time(struct inode *inode, int flags)
 {
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_glock *gl = ip->i_gl;
diff --git a/fs/inode.c b/fs/inode.c
index e07e45f6cd01..e50d94a136fe 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1958,10 +1958,10 @@ EXPORT_SYMBOL(generic_update_time);
  * This does the actual work of updating an inodes time or version.  Must have
  * had called mnt_want_write() before calling this.
  */
-int inode_update_time(struct inode *inode, struct timespec64 *time, int flags)
+int inode_update_time(struct inode *inode, int flags)
 {
if (inode->i_op->update_time)
-   return inode->i_op->update_time(inode, time, flags);
+   return inode->i_op->update_time(inode, flags);
generic_update_time(inode, flags);
return 0;
 }
@@ -2015,7 +2015

[Cluster-devel] [PATCH v7 03/13] fs: drop the timespec64 arg from generic_update_time

2023-08-07 Thread Jeff Layton
In future patches we're going to change how the ctime is updated
to keep track of when it has been queried. The way that the update_time
operation works (and a lot of its callers) make this difficult, since
they grab a timestamp early and then pass it down to eventually be
copied into the inode.

All of the existing update_time callers pass in the result of
current_time() in some fashion. Drop the "time" parameter from
generic_update_time, and rework it to fetch its own timestamp.

This change means that an update_time could fetch a different timestamp
than was seen in inode_needs_update_time. update_time is only ever
called with one of two flag combinations: Either S_ATIME is set, or
S_MTIME|S_CTIME|S_VERSION are set.

With this change we now treat the flags argument as an indicator that
some value needed to be updated when last checked, rather than an
indication to update specific timestamps.

Rework the logic for updating the timestamps and put it in a new
inode_update_timestamps helper that other update_time routines can use.
S_ATIME is as treated as we always have, but if any of the other three
are set, then we attempt to update all three.

Also, some callers of generic_update_time need to know what timestamps
were actually updated. Change it to return an S_* flag mask to indicate
that and rework the callers to expect it.

Signed-off-by: Jeff Layton 
---
 fs/gfs2/inode.c |  3 +-
 fs/inode.c  | 84 +
 fs/orangefs/inode.c |  3 +-
 fs/ubifs/file.c |  6 ++--
 fs/xfs/xfs_iops.c   |  6 ++--
 include/linux/fs.h  |  3 +-
 6 files changed, 80 insertions(+), 25 deletions(-)

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 200cabf3b393..f1f04557aa21 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -2155,7 +2155,8 @@ static int gfs2_update_time(struct inode *inode, struct 
timespec64 *time,
if (error)
return error;
}
-   return generic_update_time(inode, time, flags);
+   generic_update_time(inode, flags);
+   return 0;
 }
 
 static const struct inode_operations gfs2_file_iops = {
diff --git a/fs/inode.c b/fs/inode.c
index 3fc251bfaf73..e07e45f6cd01 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1881,29 +1881,76 @@ static int relatime_need_update(struct vfsmount *mnt, 
struct inode *inode,
return 0;
 }
 
-int generic_update_time(struct inode *inode, struct timespec64 *time, int 
flags)
+/**
+ * inode_update_timestamps - update the timestamps on the inode
+ * @inode: inode to be updated
+ * @flags: S_* flags that needed to be updated
+ *
+ * The update_time function is called when an inode's timestamps need to be
+ * updated for a read or write operation. This function handles updating the
+ * actual timestamps. It's up to the caller to ensure that the inode is marked
+ * dirty appropriately.
+ *
+ * In the case where any of S_MTIME, S_CTIME, or S_VERSION need to be updated,
+ * attempt to update all three of them. S_ATIME updates can be handled
+ * independently of the rest.
+ *
+ * Returns a set of S_* flags indicating which values changed.
+ */
+int inode_update_timestamps(struct inode *inode, int flags)
 {
-   int dirty_flags = 0;
+   int updated = 0;
+   struct timespec64 now;
+
+   if (flags & (S_MTIME|S_CTIME|S_VERSION)) {
+   struct timespec64 ctime = inode_get_ctime(inode);
 
-   if (flags & (S_ATIME | S_CTIME | S_MTIME)) {
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_CTIME)
-   inode_set_ctime_to_ts(inode, *time);
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
-
-   if (inode->i_sb->s_flags & SB_LAZYTIME)
-   dirty_flags |= I_DIRTY_TIME;
-   else
-   dirty_flags |= I_DIRTY_SYNC;
+   now = inode_set_ctime_current(inode);
+   if (!timespec64_equal(&now, &ctime))
+   updated |= S_CTIME;
+   if (!timespec64_equal(&now, &inode->i_mtime)) {
+   inode->i_mtime = now;
+   updated |= S_MTIME;
+   }
+   if (IS_I_VERSION(inode) && inode_maybe_inc_iversion(inode, 
updated))
+   updated |= S_VERSION;
+   } else {
+   now = current_time(inode);
}
 
-   if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
-   dirty_flags |= I_DIRTY_SYNC;
+   if (flags & S_ATIME) {
+   if (!timespec64_equal(&now, &inode->i_atime)) {
+   inode->i_atime = now;
+   updated |= S_ATIME;
+   }
+   }
+   return updated;
+}
+EXPORT_SYMBOL(inode_update_timestamps);
+
+/**
+ * generic_update_

[Cluster-devel] [PATCH v7 07/13] xfs: have xfs_vn_update_time gets its own timestamp

2023-08-07 Thread Jeff Layton
In later patches we're going to drop the "now" parameter from the
update_time operation. Prepare XFS for this by reworking how it fetches
timestamps and sets them in the inode. Ensure that we update the ctime
even if only S_MTIME is set.

Signed-off-by: Jeff Layton 
---
 fs/xfs/xfs_iops.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 731f45391baa..72d18e7840f5 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1037,6 +1037,7 @@ xfs_vn_update_time(
int log_flags = XFS_ILOG_TIMESTAMP;
struct xfs_trans*tp;
int error;
+   struct timespec64   now = current_time(inode);
 
trace_xfs_update_time(ip);
 
@@ -1056,12 +1057,15 @@ xfs_vn_update_time(
return error;
 
xfs_ilock(ip, XFS_ILOCK_EXCL);
-   if (flags & S_CTIME)
-   inode_set_ctime_to_ts(inode, *now);
+   if (flags & (S_CTIME|S_MTIME))
+   now = inode_set_ctime_current(inode);
+   else
+   now = current_time(inode);
+
if (flags & S_MTIME)
-   inode->i_mtime = *now;
+   inode->i_mtime = now;
if (flags & S_ATIME)
-   inode->i_atime = *now;
+   inode->i_atime = now;
 
xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
xfs_trans_log_inode(tp, ip, log_flags);

-- 
2.41.0



[Cluster-devel] [PATCH v7 10/13] tmpfs: add support for multigrain timestamps

2023-08-07 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Reviewed-by: Jan Kara 
Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 142ead70e8c1..98cc4be7a8a8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4220,7 +4220,7 @@ static struct file_system_type shmem_fs_type = {
 #endif
.kill_sb= kill_litter_super,
 #ifdef CONFIG_SHMEM
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 #else
.fs_flags   = FS_USERNS_MOUNT,
 #endif

-- 
2.41.0



[Cluster-devel] [PATCH v7 04/13] btrfs: have it use inode_update_timestamps

2023-08-07 Thread Jeff Layton
In later patches, we're going to drop the "now" argument from the
update_time operation. Have btrfs_update_time use the new
inode_update_timestamps helper to fetch a new timestamp and update it
properly.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/inode.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 29a20f828dda..d52e7d64570a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6068,14 +6068,7 @@ static int btrfs_update_time(struct inode *inode, struct 
timespec64 *now,
if (btrfs_root_readonly(root))
return -EROFS;
 
-   if (flags & S_VERSION)
-   dirty |= inode_maybe_inc_iversion(inode, dirty);
-   if (flags & S_CTIME)
-   inode_set_ctime_to_ts(inode, *now);
-   if (flags & S_MTIME)
-   inode->i_mtime = *now;
-   if (flags & S_ATIME)
-   inode->i_atime = *now;
+   dirty = inode_update_timestamps(inode, flags);
return dirty ? btrfs_dirty_inode(BTRFS_I(inode)) : 0;
 }
 

-- 
2.41.0



[Cluster-devel] [PATCH v7 05/13] fat: make fat_update_time get its own timestamp

2023-08-07 Thread Jeff Layton
In later patches, we're going to drop the "now" parameter from the
update_time operation. Fix fat_update_time to fetch its own timestamp.
It turns out that this is easily done by just passing a NULL timestamp
pointer to fat_update_time.

Also, it may be that things have changed by the time we get to calling
fat_update_time after checking inode_needs_update_time. Ensure that we
attempt the i_version bump if any of the S_* flags besides S_ATIME are
set.

Signed-off-by: Jeff Layton 
---
 fs/fat/misc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index 67006ea08db6..8cab87145d63 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -347,14 +347,14 @@ int fat_update_time(struct inode *inode, struct 
timespec64 *now, int flags)
return 0;
 
if (flags & (S_ATIME | S_CTIME | S_MTIME)) {
-   fat_truncate_time(inode, now, flags);
+   fat_truncate_time(inode, NULL, flags);
if (inode->i_sb->s_flags & SB_LAZYTIME)
dirty_flags |= I_DIRTY_TIME;
else
dirty_flags |= I_DIRTY_SYNC;
}
 
-   if ((flags & S_VERSION) && inode_maybe_inc_iversion(inode, false))
+   if ((flags & (S_VERSION|S_CTIME|S_MTIME)) && 
inode_maybe_inc_iversion(inode, false))
dirty_flags |= I_DIRTY_SYNC;
 
__mark_inode_dirty(inode, dirty_flags);

-- 
2.41.0



[Cluster-devel] [PATCH v7 11/13] xfs: switch to multigrain timestamps

2023-08-07 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Acked-by: "Darrick J. Wong" 
Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
 fs/xfs/xfs_iops.c   | 8 
 fs/xfs/xfs_super.c  | 2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 6b2296ff248a..ad22656376d3 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode->i_mtime = tv;
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c73529f77bac..2ededd3f6b8c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -573,10 +573,10 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode->i_atime;
-   stat->mtime = inode->i_mtime;
-   stat->ctime = inode_get_ctime(inode);
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
+   fill_mg_cmtime(stat, request_mask, inode);
+
if (xfs_has_v3inodes(mp)) {
if (request_mask & STATX_BTIME) {
stat->result_mask |= STATX_BTIME;
@@ -917,7 +917,7 @@ xfs_setattr_size(
if (newsize != oldsize &&
!(iattr->ia_valid & (ATTR_CTIME | ATTR_MTIME))) {
iattr->ia_ctime = iattr->ia_mtime =
-   current_time(inode);
+   current_mgtime(inode);
iattr->ia_valid |= ATTR_CTIME | ATTR_MTIME;
}
 
@@ -1036,7 +1036,7 @@ xfs_vn_update_time(
int log_flags = XFS_ILOG_TIMESTAMP;
struct xfs_trans*tp;
int error;
-   struct timespec64   now = current_time(inode);
+   struct timespec64   now;
 
trace_xfs_update_time(ip);
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 818510243130..4b10edb2c972 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2009,7 +2009,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.41.0



[Cluster-devel] [PATCH v7 06/13] ubifs: have ubifs_update_time use inode_update_timestamps

2023-08-07 Thread Jeff Layton
In later patches, we're going to drop the "now" parameter from the
update_time operation. Prepare ubifs for this, by having it use the new
inode_update_timestamps helper.

Signed-off-by: Jeff Layton 
---
 fs/ubifs/file.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index df9086b19cd0..2d0178922e19 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1397,15 +1397,9 @@ int ubifs_update_time(struct inode *inode, struct 
timespec64 *time,
return err;
 
mutex_lock(&ui->ui_mutex);
-   if (flags & S_ATIME)
-   inode->i_atime = *time;
-   if (flags & S_CTIME)
-   inode_set_ctime_to_ts(inode, *time);
-   if (flags & S_MTIME)
-   inode->i_mtime = *time;
-
-   release = ui->dirty;
+   inode_update_timestamps(inode, flags);
__mark_inode_dirty(inode, I_DIRTY_SYNC);
+   release = ui->dirty;
mutex_unlock(&ui->ui_mutex);
if (release)
ubifs_release_budget(c, &req);

-- 
2.41.0



[Cluster-devel] [PATCH v7 00/13] fs: implement multigrain timestamps

2023-08-07 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this coarseness has always been an issue when we're
exporting via NFSv3, which relies on timestamps to validate caches. A
lot of changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps (e.g
backup applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. The idea is to use an unused bit in the ctime's
tv_nsec field to mark when the mtime or ctime has been queried via
getattr. Once that has been marked, the next m/ctime update will use a
fine-grained timestamp.

Credit goes to Dave Chinner for the original idea, and to Ben Coddington
for the catchy name. This series should apply cleanly onto Christian's
vfs.ctime branch, once the v6 mgtime patches have been dropped. That
should be everything above this commit:

525deaeb2fbf gfs2: fix timestamp handling on quota inodes

base-commit: cf22d118b89a09a0160586412160d89098f7c4c7
Signed-off-by: Jeff Layton 
---
Changes in v7:
- change update_time operation to fetch the current time itself
- don't modify current_time operation. Leave it always returning coarse 
timestamp
- rework inode_set_ctime_current for better atomicity and ensure that
  all mgtime filesystems use it
- reorder arguments to fill_mg_cmtime

Changes in v6:
- drop the patch that removed XFS_ICHGTIME_CHG
- change WARN_ON_ONCE to ASSERT in xfs conversion patch

---
Jeff Layton (13):
  fs: remove silly warning from current_time
  fs: pass the request_mask to generic_fillattr
  fs: drop the timespec64 arg from generic_update_time
  btrfs: have it use inode_update_timestamps
  fat: make fat_update_time get its own timestamp
  ubifs: have ubifs_update_time use inode_update_timestamps
  xfs: have xfs_vn_update_time gets its own timestamp
  fs: drop the timespec64 argument from update_time
  fs: add infrastructure for multigrain timestamps
  tmpfs: add support for multigrain timestamps
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps

 fs/9p/vfs_inode.c   |   4 +-
 fs/9p/vfs_inode_dotl.c  |   4 +-
 fs/afs/inode.c  |   2 +-
 fs/bad_inode.c  |   3 +-
 fs/btrfs/file.c |  24 +
 fs/btrfs/inode.c|  14 +--
 fs/btrfs/super.c|   5 +-
 fs/btrfs/volumes.c  |   4 +-
 fs/ceph/inode.c |   2 +-
 fs/coda/inode.c |   3 +-
 fs/ecryptfs/inode.c |   5 +-
 fs/erofs/inode.c|   2 +-
 fs/exfat/file.c |   2 +-
 fs/ext2/inode.c |   2 +-
 fs/ext4/inode.c |   2 +-
 fs/ext4/super.c |   2 +-
 fs/f2fs/file.c  |   2 +-
 fs/fat/fat.h|   3 +-
 fs/fat/file.c   |   2 +-
 fs/fat/misc.c   |   6 +-
 fs/fuse/dir.c   |   2 +-
 fs/gfs2/inode.c |   8 +-
 fs/hfsplus/inode.c  |   2 +-
 fs/inode.c  | 200 +++-
 fs/kernfs/inode.c   |   2 +-
 fs/libfs.c  |   4 +-
 fs/minix/inode.c|   2 +-
 fs/nfs/inode.c  |   2 +-
 fs/nfs/namespace.c  |   3 +-
 fs/ntfs3/file.c |   2 +-
 fs/ocfs2/file.c |   2 +-
 fs/orangefs/inode.c |   5 +-
 fs/overlayfs/inode.c|   2 +-
 fs/overlayfs/overlayfs.h|   2 +-
 fs/proc/base.c  |   4 +-
 fs/proc/fd.c|   2 +-
 fs/proc/generic.c   |   2 +-
 fs/proc/proc_net.c  |   2 +-
 fs/proc/proc_sysctl.c   |   2 +-
 fs/proc/root.c  |   3 +-
 fs/smb/client/inode.c   |   2 +-
 fs/smb/server/smb2pdu.c |  22 ++---
 fs/smb/server/vfs.c |   3 +-
 fs/stat.c   |  65 ++---
 fs/sysv/itree.c |   3 +-
 fs/ubifs/dir.c  |   2 +-
 fs/ubifs/file.c |  19 ++--
 fs/ubifs/ubifs.h|   2 +-
 fs/udf/symlink.c|   2 +-
 fs/vboxsf/utils.c   |   2 +-
 fs/xfs/libxfs/xfs_trans_inode

[Cluster-devel] [PATCH v7 01/13] fs: remove silly warning from current_time

2023-08-07 Thread Jeff Layton
An inode with no superblock? Unpossible!

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d4ab92233062..3fc251bfaf73 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2495,12 +2495,6 @@ struct timespec64 current_time(struct inode *inode)
struct timespec64 now;
 
ktime_get_coarse_real_ts64(&now);
-
-   if (unlikely(!inode->i_sb)) {
-   WARN(1, "current_time() called with uninitialized super_block 
in the inode");
-   return now;
-   }
-
return timestamp_truncate(now, inode);
 }
 EXPORT_SYMBOL(current_time);

-- 
2.41.0



Re: [Cluster-devel] [PATCH v6 2/7] fs: add infrastructure for multigrain timestamps

2023-08-02 Thread Jeff Layton
On Wed, 2023-08-02 at 21:35 +0200, Jan Kara wrote:
> On Tue 25-07-23 10:58:15, Jeff Layton wrote:
> > The VFS always uses coarse-grained timestamps when updating the ctime
> > and mtime after a change. This has the benefit of allowing filesystems
> > to optimize away a lot metadata updates, down to around 1 per jiffy,
> > even when a file is under heavy writes.
> > 
> > Unfortunately, this has always been an issue when we're exporting via
> > NFSv3, which relies on timestamps to validate caches. A lot of changes
> > can happen in a jiffy, so timestamps aren't sufficient to help the
> > client decide to invalidate the cache. Even with NFSv4, a lot of
> > exported filesystems don't properly support a change attribute and are
> > subject to the same problems with timestamp granularity. Other
> > applications have similar issues with timestamps (e.g backup
> > applications).
> > 
> > If we were to always use fine-grained timestamps, that would improve the
> > situation, but that becomes rather expensive, as the underlying
> > filesystem would have to log a lot more metadata updates.
> > 
> > What we need is a way to only use fine-grained timestamps when they are
> > being actively queried.
> > 
> > POSIX generally mandates that when the the mtime changes, the ctime must
> > also change. The kernel always stores normalized ctime values, so only
> > the first 30 bits of the tv_nsec field are ever used.
> > 
> > Use the 31st bit of the ctime tv_nsec field to indicate that something
> > has queried the inode for the mtime or ctime. When this flag is set,
> > on the next mtime or ctime update, the kernel will fetch a fine-grained
> > timestamp instead of the usual coarse-grained one.
> > 
> > Filesytems can opt into this behavior by setting the FS_MGTIME flag in
> > the fstype. Filesystems that don't set this flag will continue to use
> > coarse-grained timestamps.
> > 
> > Later patches will convert individual filesystems to use the new
> > infrastructure.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/inode.c | 98 
> > ++
> >  fs/stat.c  | 41 +--
> >  include/linux/fs.h | 45 +++--
> >  3 files changed, 151 insertions(+), 33 deletions(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index d4ab92233062..369621e7faf5 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1919,6 +1919,21 @@ int inode_update_time(struct inode *inode, struct 
> > timespec64 *time, int flags)
> >  }
> >  EXPORT_SYMBOL(inode_update_time);
> >  
> > +/**
> > + * current_coarse_time - Return FS time
> > + * @inode: inode.
> > + *
> > + * Return the current coarse-grained time truncated to the time
> > + * granularity supported by the fs.
> > + */
> > +static struct timespec64 current_coarse_time(struct inode *inode)
> > +{
> > +   struct timespec64 now;
> > +
> > +   ktime_get_coarse_real_ts64(&now);
> > +   return timestamp_truncate(now, inode);
> > +}
> > +
> >  /**
> >   * atime_needs_update  -   update the access time
> >   * @path: the &struct path to update
> > @@ -1952,7 +1967,7 @@ bool atime_needs_update(const struct path *path, 
> > struct inode *inode)
> > if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
> > return false;
> >  
> > -   now = current_time(inode);
> > +   now = current_coarse_time(inode);
> >  
> > if (!relatime_need_update(mnt, inode, now))
> > return false;
> > @@ -1986,7 +2001,7 @@ void touch_atime(const struct path *path)
> >  * We may also fail on filesystems that have the ability to make parts
> >  * of the fs read only, e.g. subvolumes in Btrfs.
> >  */
> > -   now = current_time(inode);
> > +   now = current_coarse_time(inode);
> > inode_update_time(inode, &now, S_ATIME);
> > __mnt_drop_write(mnt);
> >  skip_update:
> 
> There are also calls in fs/smb/client/file.c:cifs_readpage_worker() and in
> fs/ocfs2/file.c:ocfs2_update_inode_atime() that should probably use
> current_coarse_time() to avoid needless querying of fine grained
> timestamps. But see below...
> 

Technically, they already devolve to current_coarse_time anyway, but
changing them would allow them to skip the fstype flag check, but I like
your idea below better anyway.

> > @@ -2072,6 +2087,56 @@ int file_remove_privs(struct file *file)
&

Re: [Cluster-devel] [PATCH v6 5/7] xfs: switch to multigrain timestamps

2023-08-02 Thread Jeff Layton
On Wed, 2023-08-02 at 10:48 -0700, Darrick J. Wong wrote:
> On Tue, Jul 25, 2023 at 10:58:18AM -0400, Jeff Layton wrote:
> > Enable multigrain timestamps, which should ensure that there is an
> > apparent change to the timestamp whenever it has been written after
> > being actively observed via getattr.
> > 
> > Also, anytime the mtime changes, the ctime must also change, and those
> > are now the only two options for xfs_trans_ichgtime. Have that function
> > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > always set.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
> >  fs/xfs/xfs_iops.c   | 4 ++--
> >  fs/xfs/xfs_super.c  | 2 +-
> >  3 files changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c 
> > b/fs/xfs/libxfs/xfs_trans_inode.c
> > index 6b2296ff248a..ad22656376d3 100644
> > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
> > ASSERT(tp);
> > ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> >  
> > -   tv = current_time(inode);
> > +   /* If the mtime changes, then ctime must also change */
> > +   ASSERT(flags & XFS_ICHGTIME_CHG);
> >  
> > +   tv = inode_set_ctime_current(inode);
> > if (flags & XFS_ICHGTIME_MOD)
> > inode->i_mtime = tv;
> > -   if (flags & XFS_ICHGTIME_CHG)
> > -   inode_set_ctime_to_ts(inode, tv);
> > if (flags & XFS_ICHGTIME_CREATE)
> > ip->i_crtime = tv;
> >  }
> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index 3a9363953ef2..3f89ef5a2820 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -573,10 +573,10 @@ xfs_vn_getattr(
> > stat->gid = vfsgid_into_kgid(vfsgid);
> > stat->ino = ip->i_ino;
> > stat->atime = inode->i_atime;
> > -   stat->mtime = inode->i_mtime;
> > -   stat->ctime = inode_get_ctime(inode);
> > stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
> >  
> > +   fill_mg_cmtime(request_mask, inode, stat);
> 
> Huh.  I would've thought @stat would come first since that's what we're
> acting upon, but ... eh. :)
> 
> If everyone else is ok with the fill_mg_cmtime signature,
> Acked-by: Darrick J. Wong 
> 
> 

Good point. We can change the signature. I think xfs is the only caller
outside of the generic vfs right now, and it'd be best to do it now.

Christian, would you prefer that I send an updated series, or patches on
top of vfs.ctime that can be folded in?
 
-- 
Jeff Layton 



[Cluster-devel] [PATCH v6 5/7] xfs: switch to multigrain timestamps

2023-07-25 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
always set.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
 fs/xfs/xfs_iops.c   | 4 ++--
 fs/xfs/xfs_super.c  | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 6b2296ff248a..ad22656376d3 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   ASSERT(flags & XFS_ICHGTIME_CHG);
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode->i_mtime = tv;
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3a9363953ef2..3f89ef5a2820 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -573,10 +573,10 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode->i_atime;
-   stat->mtime = inode->i_mtime;
-   stat->ctime = inode_get_ctime(inode);
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
+   fill_mg_cmtime(request_mask, inode, stat);
+
if (xfs_has_v3inodes(mp)) {
if (request_mask & STATX_BTIME) {
stat->result_mask |= STATX_BTIME;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 818510243130..4b10edb2c972 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2009,7 +2009,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.41.0



[Cluster-devel] [PATCH v6 6/7] ext4: switch to multigrain timestamps

2023-07-25 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b54c70e1a74e..cb1ff47af156 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7279,7 +7279,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.41.0



[Cluster-devel] [PATCH v6 7/7] btrfs: convert to multigrain timestamps

2023-07-25 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 24 
 fs/btrfs/super.c |  5 +++--
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d7a9ece7a40b..b9e75c9f95ac 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1106,25 +1106,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ctime;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   if (!timespec64_equal(&inode->i_mtime, &now))
-   inode->i_mtime = now;
-
-   ctime = inode_get_ctime(inode);
-   if (!timespec64_equal(&ctime, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1156,7 +1137,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode->i_mtime = inode_set_ctime_current(inode);
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f1dd172d8d5b..8eda51b095c9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2144,7 +2144,7 @@ static struct file_system_type btrfs_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_MGTIME,
 };
 
 static struct file_system_type btrfs_root_fs_type = {
@@ -2152,7 +2152,8 @@ static struct file_system_type btrfs_root_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount_root,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.41.0



[Cluster-devel] [PATCH v6 4/7] tmpfs: add support for multigrain timestamps

2023-07-25 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 654d9a585820..b6019c905058 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4264,7 +4264,7 @@ static struct file_system_type shmem_fs_type = {
 #endif
.kill_sb= kill_litter_super,
 #ifdef CONFIG_SHMEM
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 #else
.fs_flags   = FS_USERNS_MOUNT,
 #endif

-- 
2.41.0



[Cluster-devel] [PATCH v6 3/7] tmpfs: bump the mtime/ctime/iversion when page becomes writeable

2023-07-25 Thread Jeff Layton
Most filesystems that use the pagecache will update the mtime, ctime,
and change attribute when a page becomes writeable. Add a page_mkwrite
operation for tmpfs and just use it to bump the mtime, ctime and change
attribute.

This fixes xfstest generic/080 on tmpfs.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index b154af49d2df..654d9a585820 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2169,6 +2169,16 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
return ret;
 }
 
+static vm_fault_t shmem_page_mkwrite(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct inode *inode = file_inode(vma->vm_file);
+
+   file_update_time(vma->vm_file);
+   inode_inc_iversion(inode);
+   return 0;
+}
+
 unsigned long shmem_get_unmapped_area(struct file *file,
  unsigned long uaddr, unsigned long len,
  unsigned long pgoff, unsigned long flags)
@@ -4210,6 +4220,7 @@ static const struct super_operations shmem_ops = {
 
 static const struct vm_operations_struct shmem_vm_ops = {
.fault  = shmem_fault,
+   .page_mkwrite   = shmem_page_mkwrite,
.map_pages  = filemap_map_pages,
 #ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,
@@ -4219,6 +4230,7 @@ static const struct vm_operations_struct shmem_vm_ops = {
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
.fault  = shmem_fault,
+   .page_mkwrite   = shmem_page_mkwrite,
.map_pages  = filemap_map_pages,
 #ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,

-- 
2.41.0



[Cluster-devel] [PATCH v6 2/7] fs: add infrastructure for multigrain timestamps

2023-07-25 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried.

POSIX generally mandates that when the the mtime changes, the ctime must
also change. The kernel always stores normalized ctime values, so only
the first 30 bits of the tv_nsec field are ever used.

Use the 31st bit of the ctime tv_nsec field to indicate that something
has queried the inode for the mtime or ctime. When this flag is set,
on the next mtime or ctime update, the kernel will fetch a fine-grained
timestamp instead of the usual coarse-grained one.

Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.

Later patches will convert individual filesystems to use the new
infrastructure.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 98 ++
 fs/stat.c  | 41 +--
 include/linux/fs.h | 45 +++--
 3 files changed, 151 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d4ab92233062..369621e7faf5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1919,6 +1919,21 @@ int inode_update_time(struct inode *inode, struct 
timespec64 *time, int flags)
 }
 EXPORT_SYMBOL(inode_update_time);
 
+/**
+ * current_coarse_time - Return FS time
+ * @inode: inode.
+ *
+ * Return the current coarse-grained time truncated to the time
+ * granularity supported by the fs.
+ */
+static struct timespec64 current_coarse_time(struct inode *inode)
+{
+   struct timespec64 now;
+
+   ktime_get_coarse_real_ts64(&now);
+   return timestamp_truncate(now, inode);
+}
+
 /**
  * atime_needs_update  -   update the access time
  * @path: the &struct path to update
@@ -1952,7 +1967,7 @@ bool atime_needs_update(const struct path *path, struct 
inode *inode)
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
return false;
 
-   now = current_time(inode);
+   now = current_coarse_time(inode);
 
if (!relatime_need_update(mnt, inode, now))
return false;
@@ -1986,7 +2001,7 @@ void touch_atime(const struct path *path)
 * We may also fail on filesystems that have the ability to make parts
 * of the fs read only, e.g. subvolumes in Btrfs.
 */
-   now = current_time(inode);
+   now = current_coarse_time(inode);
inode_update_time(inode, &now, S_ATIME);
__mnt_drop_write(mnt);
 skip_update:
@@ -2072,6 +2087,56 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * current_mgtime - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+static struct timespec64 current_mgtime(struct inode *inode)
+{
+   struct timespec64 now;
+   atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec;
+   long nsec = atomic_long_read(pnsec);
+
+   if (nsec & I_CTIME_QUERIED) {
+   ktime_get_real_ts64(&now);
+   } else {
+   struct timespec64 ctime;
+
+   ktime_get_coarse_real_ts64(&now);
+
+   /*
+* If we've recently fetched a fine-grained timestamp
+* then the coarse-grained one may still be earlier than the
+* existing one. Just keep the existing ctime if so.
+*/
+   ctime = inode_get_ctime(inode);
+   if (timespec64_compare(&ctime, &now) > 0)
+   now = ctime;
+   }
+
+   return timestamp_truncate(now, inode);
+}
+
+/**
+ * current_time - Return timestamp suitable for ctime update
+ * @inode: inode to eventually be updated
+ *
+ * Return the current tim

[Cluster-devel] [PATCH v6 1/7] fs: pass the request_mask to generic_fillattr

2023-07-25 Thread Jeff Layton
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Signed-off-by: Jeff Layton 
---
 fs/9p/vfs_inode.c   |  4 ++--
 fs/9p/vfs_inode_dotl.c  |  4 ++--
 fs/afs/inode.c  |  2 +-
 fs/btrfs/inode.c|  2 +-
 fs/ceph/inode.c |  2 +-
 fs/coda/inode.c |  3 ++-
 fs/ecryptfs/inode.c |  5 +++--
 fs/erofs/inode.c|  2 +-
 fs/exfat/file.c |  2 +-
 fs/ext2/inode.c |  2 +-
 fs/ext4/inode.c |  2 +-
 fs/f2fs/file.c  |  2 +-
 fs/fat/file.c   |  2 +-
 fs/fuse/dir.c   |  2 +-
 fs/gfs2/inode.c |  2 +-
 fs/hfsplus/inode.c  |  2 +-
 fs/kernfs/inode.c   |  2 +-
 fs/libfs.c  |  4 ++--
 fs/minix/inode.c|  2 +-
 fs/nfs/inode.c  |  2 +-
 fs/nfs/namespace.c  |  3 ++-
 fs/ntfs3/file.c |  2 +-
 fs/ocfs2/file.c |  2 +-
 fs/orangefs/inode.c |  2 +-
 fs/proc/base.c  |  4 ++--
 fs/proc/fd.c|  2 +-
 fs/proc/generic.c   |  2 +-
 fs/proc/proc_net.c  |  2 +-
 fs/proc/proc_sysctl.c   |  2 +-
 fs/proc/root.c  |  3 ++-
 fs/smb/client/inode.c   |  2 +-
 fs/smb/server/smb2pdu.c | 22 +++---
 fs/smb/server/vfs.c |  3 ++-
 fs/stat.c   | 18 ++
 fs/sysv/itree.c |  3 ++-
 fs/ubifs/dir.c  |  2 +-
 fs/udf/symlink.c|  2 +-
 fs/vboxsf/utils.c   |  2 +-
 include/linux/fs.h  |  2 +-
 mm/shmem.c  |  2 +-
 40 files changed, 70 insertions(+), 62 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 16d85e6033a3..d24d1f20e922 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1016,7 +1016,7 @@ v9fs_vfs_getattr(struct mnt_idmap *idmap, const struct 
path *path,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) {
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
return 0;
} else if (v9ses->cache & CACHE_WRITEBACK) {
if (S_ISREG(inode->i_mode)) {
@@ -1037,7 +1037,7 @@ v9fs_vfs_getattr(struct mnt_idmap *idmap, const struct 
path *path,
return PTR_ERR(st);
 
v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb, 0);
-   generic_fillattr(&nop_mnt_idmap, d_inode(dentry), stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(dentry), stat);
 
p9stat_free(st);
kfree(st);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 464ea73d1bf8..8e8d5d2a13d8 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -451,7 +451,7 @@ v9fs_vfs_getattr_dotl(struct mnt_idmap *idmap,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) {
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
return 0;
} else if (v9ses->cache) {
if (S_ISREG(inode->i_mode)) {
@@ -476,7 +476,7 @@ v9fs_vfs_getattr_dotl(struct mnt_idmap *idmap,
return PTR_ERR(st);
 
v9fs_stat2inode_dotl(st, d_inode(dentry), 0);
-   generic_fillattr(&nop_mnt_idmap, d_inode(dentry), stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(dentry), stat);
/* Change block size to what the server returned */
stat->blksize = st->st_blksize;
 
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 6b636f43f548..1c794a1896aa 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -773,7 +773,7 @@ int afs_getattr(struct mnt_idmap *idmap, const struct path 
*path,
 
do {
read_seqbegin_or_lock(&vnode->cb_lock, &seq);
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
if (test_bit(AFS_VNODE_SILLY_DELETED, &vnode->flags) &&
stat->nlink > 0)
stat->nlink -= 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bcccd551f547..7346059209aa 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8773,7 +8773,7 @@ static int btrfs_getattr(struct mnt_idmap *idmap,
   

[Cluster-devel] [PATCH v6 0/7] fs: implement multigrain timestamps

2023-07-25 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this coarseness has always been an issue when we're
exporting via NFSv3, which relies on timestamps to validate caches. A
lot of changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps (e.g
backup applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. The idea is to use an unused bit in the ctime's
tv_nsec field to mark when the mtime or ctime has been queried via
getattr. Once that has been marked, the next m/ctime update will use a
fine-grained timestamp.

This patch series is based on top of Christian's vfs.all branch, which
has the recent conversion to the new ctime accessors. It should apply
cleanly on top of linux-next.

The first two patches should probably go in via the vfs tree. Should the
fs-specific patches go in that way as well, or should they go via
maintainer trees? Either should be fine.

The first two patches should probably go in via Christian's vfs tree.
The rest could go via maintainer trees or the vfs tree.

For now, I'd like to get these into linux-next. Christian, would you be
willing to pick these up for now? Alternately, I can feed them there via
the iversion branch that Stephen is already pulling in from my tree.

Signed-off-by: Jeff Layton 
base-commit: cf22d118b89a09a0160586412160d89098f7c4c7
---
Changes in v6:
- drop the patch that removed XFS_ICHGTIME_CHG
- change WARN_ON_ONCE to ASSERT in xfs conversion patch

---
Jeff Layton (7):
  fs: pass the request_mask to generic_fillattr
  fs: add infrastructure for multigrain timestamps
  tmpfs: bump the mtime/ctime/iversion when page becomes writeable
  tmpfs: add support for multigrain timestamps
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps

 fs/9p/vfs_inode.c   |  4 +-
 fs/9p/vfs_inode_dotl.c  |  4 +-
 fs/afs/inode.c  |  2 +-
 fs/btrfs/file.c | 24 ++
 fs/btrfs/inode.c|  2 +-
 fs/btrfs/super.c|  5 ++-
 fs/ceph/inode.c |  2 +-
 fs/coda/inode.c |  3 +-
 fs/ecryptfs/inode.c |  5 ++-
 fs/erofs/inode.c|  2 +-
 fs/exfat/file.c |  2 +-
 fs/ext2/inode.c |  2 +-
 fs/ext4/inode.c |  2 +-
 fs/ext4/super.c |  2 +-
 fs/f2fs/file.c  |  2 +-
 fs/fat/file.c   |  2 +-
 fs/fuse/dir.c   |  2 +-
 fs/gfs2/inode.c |  2 +-
 fs/hfsplus/inode.c  |  2 +-
 fs/inode.c  | 98 +
 fs/kernfs/inode.c   |  2 +-
 fs/libfs.c  |  4 +-
 fs/minix/inode.c|  2 +-
 fs/nfs/inode.c  |  2 +-
 fs/nfs/namespace.c  |  3 +-
 fs/ntfs3/file.c |  2 +-
 fs/ocfs2/file.c |  2 +-
 fs/orangefs/inode.c |  2 +-
 fs/proc/base.c  |  4 +-
 fs/proc/fd.c|  2 +-
 fs/proc/generic.c   |  2 +-
 fs/proc/proc_net.c  |  2 +-
 fs/proc/proc_sysctl.c   |  2 +-
 fs/proc/root.c  |  3 +-
 fs/smb/client/inode.c   |  2 +-
 fs/smb/server/smb2pdu.c | 22 -
 fs/smb/server/vfs.c |  3 +-
 fs/stat.c   | 59 -
 fs/sysv/itree.c |  3 +-
 fs/ubifs/dir.c  |  2 +-
 fs/udf/symlink.c|  2 +-
 fs/vboxsf/utils.c   |  2 +-
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +--
 fs/xfs/xfs_iops.c   |  4 +-
 fs/xfs/xfs_super.c  |  2 +-
 include/linux/fs.h  | 47 ++--
 mm/shmem.c  | 16 ++-
 47 files changed, 248 insertions(+), 125 deletions(-)
---
base-commit: 810b5fff7917119ea82ff96e312e2d4350d6b681
change-id: 20230713-mgctime-f2a9fc324918

Best regards,
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v2 35/47] nfsd: dynamically allocate the nfsd-reply shrinker

2023-07-24 Thread Jeff Layton
On Mon, 2023-07-24 at 17:43 +0800, Qi Zheng wrote:
> In preparation for implementing lockless slab shrink, use new APIs to
> dynamically allocate the nfsd-reply shrinker, so that it can be freed
> asynchronously using kfree_rcu(). Then it doesn't need to wait for RCU
> read-side critical section when releasing the struct nfsd_net.
> 
> Acked-by: Chuck Lever 
> Signed-off-by: Qi Zheng 
> ---
>  fs/nfsd/netns.h|  2 +-
>  fs/nfsd/nfscache.c | 31 ---
>  2 files changed, 17 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index f669444d5336..ab303a8b77d5 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -177,7 +177,7 @@ struct nfsd_net {
>   /* size of cache when we saw the longest hash chain */
>   unsigned int longest_chain_cachesize;
>  
> - struct shrinker nfsd_reply_cache_shrinker;
> + struct shrinker *nfsd_reply_cache_shrinker;
>  
>   /* tracking server-to-server copy mounts */
>   spinlock_t  nfsd_ssc_lock;
> diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c
> index 6eb3d7bdfaf3..9f0ab65e4125 100644
> --- a/fs/nfsd/nfscache.c
> +++ b/fs/nfsd/nfscache.c
> @@ -200,26 +200,29 @@ int nfsd_reply_cache_init(struct nfsd_net *nn)
>  {
>   unsigned int hashsize;
>   unsigned int i;
> - int status = 0;
>  
>   nn->max_drc_entries = nfsd_cache_size_limit();
>   atomic_set(&nn->num_drc_entries, 0);
>   hashsize = nfsd_hashsize(nn->max_drc_entries);
>   nn->maskbits = ilog2(hashsize);
>  
> - nn->nfsd_reply_cache_shrinker.scan_objects = nfsd_reply_cache_scan;
> - nn->nfsd_reply_cache_shrinker.count_objects = nfsd_reply_cache_count;
> - nn->nfsd_reply_cache_shrinker.seeks = 1;
> - status = register_shrinker(&nn->nfsd_reply_cache_shrinker,
> -"nfsd-reply:%s", nn->nfsd_name);
> - if (status)
> - return status;
> -
>   nn->drc_hashtbl = kvzalloc(array_size(hashsize,
>   sizeof(*nn->drc_hashtbl)), GFP_KERNEL);
>   if (!nn->drc_hashtbl)
> + return -ENOMEM;
> +
> + nn->nfsd_reply_cache_shrinker = shrinker_alloc(0, "nfsd-reply:%s",
> +nn->nfsd_name);
> + if (!nn->nfsd_reply_cache_shrinker)
>   goto out_shrinker;
>  
> + nn->nfsd_reply_cache_shrinker->scan_objects = nfsd_reply_cache_scan;
> + nn->nfsd_reply_cache_shrinker->count_objects = nfsd_reply_cache_count;
> + nn->nfsd_reply_cache_shrinker->seeks = 1;
> + nn->nfsd_reply_cache_shrinker->private_data = nn;
> +
> + shrinker_register(nn->nfsd_reply_cache_shrinker);
> +
>   for (i = 0; i < hashsize; i++) {
>   INIT_LIST_HEAD(&nn->drc_hashtbl[i].lru_head);
>   spin_lock_init(&nn->drc_hashtbl[i].cache_lock);
> @@ -228,7 +231,7 @@ int nfsd_reply_cache_init(struct nfsd_net *nn)
>  
>   return 0;
>  out_shrinker:
> - unregister_shrinker(&nn->nfsd_reply_cache_shrinker);
> + kvfree(nn->drc_hashtbl);
>   printk(KERN_ERR "nfsd: failed to allocate reply cache\n");
>   return -ENOMEM;
>  }
> @@ -238,7 +241,7 @@ void nfsd_reply_cache_shutdown(struct nfsd_net *nn)
>   struct nfsd_cacherep *rp;
>   unsigned int i;
>  
> - unregister_shrinker(&nn->nfsd_reply_cache_shrinker);
> + shrinker_unregister(nn->nfsd_reply_cache_shrinker);
>  
>   for (i = 0; i < nn->drc_hashsize; i++) {
>   struct list_head *head = &nn->drc_hashtbl[i].lru_head;
> @@ -322,8 +325,7 @@ nfsd_prune_bucket_locked(struct nfsd_net *nn, struct 
> nfsd_drc_bucket *b,
>  static unsigned long
>  nfsd_reply_cache_count(struct shrinker *shrink, struct shrink_control *sc)
>  {
> - struct nfsd_net *nn = container_of(shrink,
> - struct nfsd_net, nfsd_reply_cache_shrinker);
> + struct nfsd_net *nn = shrink->private_data;
>  
>   return atomic_read(&nn->num_drc_entries);
>  }
> @@ -342,8 +344,7 @@ nfsd_reply_cache_count(struct shrinker *shrink, struct 
> shrink_control *sc)
>  static unsigned long
>  nfsd_reply_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
>  {
> - struct nfsd_net *nn = container_of(shrink,
> - struct nfsd_net, nfsd_reply_cache_shrinker);
> + struct nfsd_net *nn = shrink->private_data;
>   unsigned long freed = 0;
>   LIST_HEAD(dispose);
>   unsigned int i;

Acked-by: Jeff Layton 



Re: [Cluster-devel] [PATCH v2 34/47] nfsd: dynamically allocate the nfsd-client shrinker

2023-07-24 Thread Jeff Layton
On Mon, 2023-07-24 at 17:43 +0800, Qi Zheng wrote:
> In preparation for implementing lockless slab shrink, use new APIs to
> dynamically allocate the nfsd-client shrinker, so that it can be freed
> asynchronously using kfree_rcu(). Then it doesn't need to wait for RCU
> read-side critical section when releasing the struct nfsd_net.
> 
> Acked-by: Chuck Lever 
> Signed-off-by: Qi Zheng 
> ---
>  fs/nfsd/netns.h |  2 +-
>  fs/nfsd/nfs4state.c | 20 
>  2 files changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index ec49b200b797..f669444d5336 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -195,7 +195,7 @@ struct nfsd_net {
>   int nfs4_max_clients;
>  
>   atomic_tnfsd_courtesy_clients;
> - struct shrinker nfsd_client_shrinker;
> + struct shrinker *nfsd_client_shrinker;
>   struct work_struct  nfsd_shrinker_work;
>  };
>  
> diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
> index 3339177f8e2f..c7a4616cd866 100644
> --- a/fs/nfsd/nfs4state.c
> +++ b/fs/nfsd/nfs4state.c
> @@ -4388,8 +4388,7 @@ static unsigned long
>  nfsd4_state_shrinker_count(struct shrinker *shrink, struct shrink_control 
> *sc)
>  {
>   int count;
> - struct nfsd_net *nn = container_of(shrink,
> - struct nfsd_net, nfsd_client_shrinker);
> + struct nfsd_net *nn = shrink->private_data;
>  
>   count = atomic_read(&nn->nfsd_courtesy_clients);
>   if (!count)
> @@ -8125,12 +8124,17 @@ static int nfs4_state_create_net(struct net *net)
>   INIT_WORK(&nn->nfsd_shrinker_work, nfsd4_state_shrinker_worker);
>   get_net(net);
>  
> - nn->nfsd_client_shrinker.scan_objects = nfsd4_state_shrinker_scan;
> - nn->nfsd_client_shrinker.count_objects = nfsd4_state_shrinker_count;
> - nn->nfsd_client_shrinker.seeks = DEFAULT_SEEKS;
> -
> - if (register_shrinker(&nn->nfsd_client_shrinker, "nfsd-client"))
> + nn->nfsd_client_shrinker = shrinker_alloc(0, "nfsd-client");
> + if (!nn->nfsd_client_shrinker)
>   goto err_shrinker;
> +
> + nn->nfsd_client_shrinker->scan_objects = nfsd4_state_shrinker_scan;
> + nn->nfsd_client_shrinker->count_objects = nfsd4_state_shrinker_count;
> + nn->nfsd_client_shrinker->seeks = DEFAULT_SEEKS;
> + nn->nfsd_client_shrinker->private_data = nn;
> +
> + shrinker_register(nn->nfsd_client_shrinker);
> +
>   return 0;
>  
>  err_shrinker:
> @@ -8228,7 +8232,7 @@ nfs4_state_shutdown_net(struct net *net)
>   struct list_head *pos, *next, reaplist;
>   struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>  
> - unregister_shrinker(&nn->nfsd_client_shrinker);
> + shrinker_unregister(nn->nfsd_client_shrinker);
>   cancel_work(&nn->nfsd_shrinker_work);
>   cancel_delayed_work_sync(&nn->laundromat_work);
>   locks_end_grace(&nn->nfsd4_manager);

Acked-by: Jeff Layton 



Re: [Cluster-devel] [RFC v6.5-rc2 3/3] fs: lockd: introduce safe async lock op

2023-07-21 Thread Jeff Layton
locate_block(lock_sop, &fp->fi_fhandle, nn);
> diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
> index 11fbd0ee1370..da742abbaf3e 100644
> --- a/include/linux/exportfs.h
> +++ b/include/linux/exportfs.h
> @@ -224,6 +224,7 @@ struct export_operations {
> atomic attribute updates
>   */
>  #define EXPORT_OP_FLUSH_ON_CLOSE (0x20) /* fs flushes file data on close 
> */
> +#define EXPORT_OP_SAFE_ASYNC_LOCK(0x40) /* fs can do async lock request 
> */
>   unsigned long   flags;
>  };
>  

-- 
Jeff Layton 



Re: [Cluster-devel] [RFC v6.5-rc2 2/3] fs: lockd: fix race in async lock request handling

2023-07-21 Thread Jeff Layton
n progress
> > @@ -572,17 +586,16 @@ nlmsvc_lock(struct svc_rqst *rqstp, struct nlm_file 
> > *file,
> > ret = nlmsvc_defer_lock_rqst(rqstp, block);
> > goto out;
> > case -EDEADLK:
> > +   nlmsvc_remove_block(block);
> > ret = nlm_deadlock;
> > goto out;
> > default:/* includes ENOLCK */
> > +   nlmsvc_remove_block(block);
> > ret = nlm_lck_denied_nolocks;
> > goto out;
> > }
> > 
> > ret = nlm_lck_blocked;
> > -
> > -   /* Append to list of blocked */
> > -   nlmsvc_insert_block(block, NLM_NEVER);
> >  out:
> > mutex_unlock(&file->f_mutex);
> > nlmsvc_release_block(block);
> > @@ -739,34 +752,59 @@ nlmsvc_update_deferred_block(struct nlm_block *block, 
> > int result)
> > block->b_flags |= B_TIMED_OUT;
> >  }
> 
> - Alex
> 

-- 
Jeff Layton 



Re: [Cluster-devel] [RFC v6.5-rc2 2/3] fs: lockd: fix race in async lock request handling

2023-07-21 Thread Jeff Layton
if (block->b_flags & B_TIMED_OUT) {
> + rc = -ENOLCK;
> + goto out;
> + }
> + nlmsvc_update_deferred_block(block, result);
> + } else if (result == 0)
> + block->b_granted = 1;
> +
> + nlmsvc_insert_block_locked(block, 0);
> + svc_wake_up(block->b_daemon);
> +out:
> + return rc;
> +}
> +
>  static int nlmsvc_grant_deferred(struct file_lock *fl, int result)
>  {
> - struct nlm_block *block;
> - int rc = -ENOENT;
> + struct nlm_block *block = NULL;
> + int rc;
>  
>   spin_lock(&nlm_blocked_lock);
>   list_for_each_entry(block, &nlm_blocked, b_list) {
>   if (nlm_compare_locks(&block->b_call->a_args.lock.fl, fl)) {
> - dprintk("lockd: nlmsvc_notify_blocked block %p flags 
> %d\n",
> - block, block->b_flags);
> - if (block->b_flags & B_QUEUED) {
> - if (block->b_flags & B_TIMED_OUT) {
> - rc = -ENOLCK;
> - break;
> - }
> - nlmsvc_update_deferred_block(block, result);
> - } else if (result == 0)
> - block->b_granted = 1;
> -
> - nlmsvc_insert_block_locked(block, 0);
> - svc_wake_up(block->b_daemon);
> - rc = 0;
> + kref_get(&block->b_count);
>   break;
>   }
>   }
>   spin_unlock(&nlm_blocked_lock);
> - if (rc == -ENOENT)
> - printk(KERN_WARNING "lockd: grant for unknown block\n");
> +
> + if (!block) {
> + pr_warn("lockd: grant for unknown pending block\n");
> + return -ENOENT;
> + }
> +
> + /* don't interfere with nlmsvc_lock() */
> + mutex_lock(&block->b_file->f_mutex);


This is called from lm_grant, and Documentation/filesystems/locking.rst
says that lm_grant is not allowed to block. The only caller though is
dlm_plock_callback, and I don't see anything that would prevent
blocking.

Do we need to fix the documentation there?


> + block->b_flags &= ~B_PENDING_CALLBACK;
> +

You're adding this new flag when the lock is deferred and then clearing
it when the lock is granted. What about when the lock request is
cancelled (e.g. by signal)? It seems like you also need to clear it then
too, correct?

> + spin_lock(&nlm_blocked_lock);
> + WARN_ON_ONCE(list_empty(&block->b_list));
> + rc = __nlmsvc_grant_deferred(block, fl, result);
> + spin_unlock(&nlm_blocked_lock);
> + mutex_unlock(&block->b_file->f_mutex);
> +
> + nlmsvc_release_block(block);
>   return rc;
>  }
>  
> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index f42594a9efe0..a977be8bcc2c 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -189,6 +189,7 @@ struct nlm_block {
>  #define B_QUEUED 1   /* lock queued */
>  #define B_GOT_CALLBACK   2   /* got lock or conflicting lock 
> */
>  #define B_TIMED_OUT  4   /* filesystem too slow to respond */
> +#define B_PENDING_CALLBACK   8   /* pending callback for lock request */
>  };
>  
>  /*

-- 
Jeff Layton 



Re: [Cluster-devel] [RFC v6.5-rc2 1/3] fs: lockd: nlm_blocked list race fixes

2023-07-21 Thread Jeff Layton
On Thu, 2023-07-20 at 08:58 -0400, Alexander Aring wrote:
> This patch fixes races when lockd accessing the global nlm_blocked list.
> It was mostly safe to access the list because everything was accessed
> from the lockd kernel thread context but there exists cases like
> nlmsvc_grant_deferred() that could manipulate the nlm_blocked list and
> it can be called from any context.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Alexander Aring 
> ---
>  fs/lockd/svclock.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index c43ccdf28ed9..28abec5c451d 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -131,12 +131,14 @@ static void nlmsvc_insert_block(struct nlm_block 
> *block, unsigned long when)
>  static inline void
>  nlmsvc_remove_block(struct nlm_block *block)
>  {
> + spin_lock(&nlm_blocked_lock);
>   if (!list_empty(&block->b_list)) {
> - spin_lock(&nlm_blocked_lock);
>   list_del_init(&block->b_list);
>   spin_unlock(&nlm_blocked_lock);
>   nlmsvc_release_block(block);
> + return;
>   }
> + spin_unlock(&nlm_blocked_lock);
>  }
>  
>  /*
> @@ -152,6 +154,7 @@ nlmsvc_lookup_block(struct nlm_file *file, struct 
> nlm_lock *lock)
>   file, lock->fl.fl_pid,
>   (long long)lock->fl.fl_start,
>   (long long)lock->fl.fl_end, lock->fl.fl_type);
> + spin_lock(&nlm_blocked_lock);
>   list_for_each_entry(block, &nlm_blocked, b_list) {
>   fl = &block->b_call->a_args.lock.fl;
>   dprintk("lockd: check f=%p pd=%d %Ld-%Ld ty=%d cookie=%s\n",
> @@ -161,9 +164,11 @@ nlmsvc_lookup_block(struct nlm_file *file, struct 
> nlm_lock *lock)
>   nlmdbg_cookie2a(&block->b_call->a_args.cookie));
>   if (block->b_file == file && nlm_compare_locks(fl, &lock->fl)) {
>   kref_get(&block->b_count);
> + spin_unlock(&nlm_blocked_lock);
>   return block;
>   }
>   }
> + spin_unlock(&nlm_blocked_lock);
>  
>   return NULL;
>  }
> @@ -185,16 +190,19 @@ nlmsvc_find_block(struct nlm_cookie *cookie)
>  {
>   struct nlm_block *block;
>  
> + spin_lock(&nlm_blocked_lock);
>   list_for_each_entry(block, &nlm_blocked, b_list) {
>   if (nlm_cookie_match(&block->b_call->a_args.cookie,cookie))
>   goto found;
>   }
> + spin_unlock(&nlm_blocked_lock);
>  
>   return NULL;
>  
>  found:
>   dprintk("nlmsvc_find_block(%s): block=%p\n", nlmdbg_cookie2a(cookie), 
> block);
>   kref_get(&block->b_count);
> + spin_unlock(&nlm_blocked_lock);
>   return block;
>  }
>  
> @@ -317,6 +325,7 @@ void nlmsvc_traverse_blocks(struct nlm_host *host,
>  
>  restart:
>   mutex_lock(&file->f_mutex);
> + spin_lock(&nlm_blocked_lock);
>   list_for_each_entry_safe(block, next, &file->f_blocks, b_flist) {
>   if (!match(block->b_host, host))
>   continue;
> @@ -325,11 +334,13 @@ void nlmsvc_traverse_blocks(struct nlm_host *host,
>   if (list_empty(&block->b_list))
>   continue;
>   kref_get(&block->b_count);
> + spin_unlock(&nlm_blocked_lock);
>   mutex_unlock(&file->f_mutex);
>   nlmsvc_unlink_block(block);
>   nlmsvc_release_block(block);
>   goto restart;
>   }
> + spin_unlock(&nlm_blocked_lock);
>   mutex_unlock(&file->f_mutex);
>  }
>  

The patch itself looks correct. Walking these lists without holding the
lock is quite suspicious. Not sure about the stable designation here
though, unless you have a way to easily reproduce this. 

Reviewed-by: Jeff Layton 



[Cluster-devel] [PATCH v5 8/8] btrfs: convert to multigrain timestamps

2023-07-13 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Beyond enabling the FS_MGTIME flag, this patch eliminates
update_time_for_write, which goes to great pains to avoid in-memory
stores. Just have it overwrite the timestamps unconditionally.

Signed-off-by: Jeff Layton 
---
 fs/btrfs/file.c  | 24 
 fs/btrfs/super.c |  5 +++--
 2 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d7a9ece7a40b..b9e75c9f95ac 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1106,25 +1106,6 @@ void btrfs_check_nocow_unlock(struct btrfs_inode *inode)
btrfs_drew_write_unlock(&inode->root->snapshot_lock);
 }
 
-static void update_time_for_write(struct inode *inode)
-{
-   struct timespec64 now, ctime;
-
-   if (IS_NOCMTIME(inode))
-   return;
-
-   now = current_time(inode);
-   if (!timespec64_equal(&inode->i_mtime, &now))
-   inode->i_mtime = now;
-
-   ctime = inode_get_ctime(inode);
-   if (!timespec64_equal(&ctime, &now))
-   inode_set_ctime_to_ts(inode, now);
-
-   if (IS_I_VERSION(inode))
-   inode_inc_iversion(inode);
-}
-
 static int btrfs_write_check(struct kiocb *iocb, struct iov_iter *from,
 size_t count)
 {
@@ -1156,7 +1137,10 @@ static int btrfs_write_check(struct kiocb *iocb, struct 
iov_iter *from,
 * need to start yet another transaction to update the inode as we will
 * update the inode when we finish writing whatever data we write.
 */
-   update_time_for_write(inode);
+   if (!IS_NOCMTIME(inode)) {
+   inode->i_mtime = inode_set_ctime_current(inode);
+   inode_inc_iversion(inode);
+   }
 
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f1dd172d8d5b..8eda51b095c9 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2144,7 +2144,7 @@ static struct file_system_type btrfs_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_MGTIME,
 };
 
 static struct file_system_type btrfs_root_fs_type = {
@@ -2152,7 +2152,8 @@ static struct file_system_type btrfs_root_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount_root,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | 
FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA |
+ FS_ALLOW_IDMAP | FS_MGTIME,
 };
 
 MODULE_ALIAS_FS("btrfs");

-- 
2.41.0



[Cluster-devel] [PATCH v5 6/8] xfs: switch to multigrain timestamps

2023-07-13 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

Also, anytime the mtime changes, the ctime must also change, and those
are now the only two options for xfs_trans_ichgtime. Have that function
unconditionally bump the ctime, and warn if XFS_ICHGTIME_CHG is ever not
set.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_trans_inode.c | 6 +++---
 fs/xfs/xfs_iops.c   | 4 ++--
 fs/xfs/xfs_super.c  | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 0c9df8df6d4a..86f5ffce2d89 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -62,12 +62,12 @@ xfs_trans_ichgtime(
ASSERT(tp);
ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 
-   tv = current_time(inode);
+   /* If the mtime changes, then ctime must also change */
+   WARN_ON_ONCE(!(flags & XFS_ICHGTIME_CHG));
 
+   tv = inode_set_ctime_current(inode);
if (flags & XFS_ICHGTIME_MOD)
inode->i_mtime = tv;
-   if (flags & XFS_ICHGTIME_CHG)
-   inode_set_ctime_to_ts(inode, tv);
 }
 
 /*
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 3a9363953ef2..3f89ef5a2820 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -573,10 +573,10 @@ xfs_vn_getattr(
stat->gid = vfsgid_into_kgid(vfsgid);
stat->ino = ip->i_ino;
stat->atime = inode->i_atime;
-   stat->mtime = inode->i_mtime;
-   stat->ctime = inode_get_ctime(inode);
stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
 
+   fill_mg_cmtime(request_mask, inode, stat);
+
if (xfs_has_v3inodes(mp)) {
if (request_mask & STATX_BTIME) {
stat->result_mask |= STATX_BTIME;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 818510243130..4b10edb2c972 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2009,7 +2009,7 @@ static struct file_system_type xfs_fs_type = {
.init_fs_context= xfs_init_fs_context,
.parameters = xfs_fs_parameters,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("xfs");
 

-- 
2.41.0



[Cluster-devel] [PATCH v5 4/8] tmpfs: add support for multigrain timestamps

2023-07-13 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

tmpfs only requires the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 654d9a585820..b6019c905058 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4264,7 +4264,7 @@ static struct file_system_type shmem_fs_type = {
 #endif
.kill_sb= kill_litter_super,
 #ifdef CONFIG_SHMEM
-   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_USERNS_MOUNT | FS_ALLOW_IDMAP | FS_MGTIME,
 #else
.fs_flags   = FS_USERNS_MOUNT,
 #endif

-- 
2.41.0



[Cluster-devel] [PATCH v5 2/8] fs: add infrastructure for multigrain timestamps

2023-07-13 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the ctime
and mtime after a change. This has the benefit of allowing filesystems
to optimize away a lot metadata updates, down to around 1 per jiffy,
even when a file is under heavy writes.

Unfortunately, this has always been an issue when we're exporting via
NFSv3, which relies on timestamps to validate caches. A lot of changes
can happen in a jiffy, so timestamps aren't sufficient to help the
client decide to invalidate the cache. Even with NFSv4, a lot of
exported filesystems don't properly support a change attribute and are
subject to the same problems with timestamp granularity. Other
applications have similar issues with timestamps (e.g backup
applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried.

POSIX generally mandates that when the the mtime changes, the ctime must
also change. The kernel always stores normalized ctime values, so only
the first 30 bits of the tv_nsec field are ever used.

Use the 31st bit of the ctime tv_nsec field to indicate that something
has queried the inode for the mtime or ctime. When this flag is set,
on the next mtime or ctime update, the kernel will fetch a fine-grained
timestamp instead of the usual coarse-grained one.

Filesytems can opt into this behavior by setting the FS_MGTIME flag in
the fstype. Filesystems that don't set this flag will continue to use
coarse-grained timestamps.

Later patches will convert individual filesystems to use the new
infrastructure.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 98 ++
 fs/stat.c  | 41 +--
 include/linux/fs.h | 45 +++--
 3 files changed, 151 insertions(+), 33 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index d4ab92233062..369621e7faf5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1919,6 +1919,21 @@ int inode_update_time(struct inode *inode, struct 
timespec64 *time, int flags)
 }
 EXPORT_SYMBOL(inode_update_time);
 
+/**
+ * current_coarse_time - Return FS time
+ * @inode: inode.
+ *
+ * Return the current coarse-grained time truncated to the time
+ * granularity supported by the fs.
+ */
+static struct timespec64 current_coarse_time(struct inode *inode)
+{
+   struct timespec64 now;
+
+   ktime_get_coarse_real_ts64(&now);
+   return timestamp_truncate(now, inode);
+}
+
 /**
  * atime_needs_update  -   update the access time
  * @path: the &struct path to update
@@ -1952,7 +1967,7 @@ bool atime_needs_update(const struct path *path, struct 
inode *inode)
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
return false;
 
-   now = current_time(inode);
+   now = current_coarse_time(inode);
 
if (!relatime_need_update(mnt, inode, now))
return false;
@@ -1986,7 +2001,7 @@ void touch_atime(const struct path *path)
 * We may also fail on filesystems that have the ability to make parts
 * of the fs read only, e.g. subvolumes in Btrfs.
 */
-   now = current_time(inode);
+   now = current_coarse_time(inode);
inode_update_time(inode, &now, S_ATIME);
__mnt_drop_write(mnt);
 skip_update:
@@ -2072,6 +2087,56 @@ int file_remove_privs(struct file *file)
 }
 EXPORT_SYMBOL(file_remove_privs);
 
+/**
+ * current_mgtime - Return FS time (possibly fine-grained)
+ * @inode: inode.
+ *
+ * Return the current time truncated to the time granularity supported by
+ * the fs, as suitable for a ctime/mtime change. If the ctime is flagged
+ * as having been QUERIED, get a fine-grained timestamp.
+ */
+static struct timespec64 current_mgtime(struct inode *inode)
+{
+   struct timespec64 now;
+   atomic_long_t *pnsec = (atomic_long_t *)&inode->__i_ctime.tv_nsec;
+   long nsec = atomic_long_read(pnsec);
+
+   if (nsec & I_CTIME_QUERIED) {
+   ktime_get_real_ts64(&now);
+   } else {
+   struct timespec64 ctime;
+
+   ktime_get_coarse_real_ts64(&now);
+
+   /*
+* If we've recently fetched a fine-grained timestamp
+* then the coarse-grained one may still be earlier than the
+* existing one. Just keep the existing ctime if so.
+*/
+   ctime = inode_get_ctime(inode);
+   if (timespec64_compare(&ctime, &now) > 0)
+   now = ctime;
+   }
+
+   return timestamp_truncate(now, inode);
+}
+
+/**
+ * current_time - Return timestamp suitable for ctime update
+ * @inode: inode to eventually be updated
+ *
+ * Return the current tim

[Cluster-devel] [PATCH v5 7/8] ext4: switch to multigrain timestamps

2023-07-13 Thread Jeff Layton
Enable multigrain timestamps, which should ensure that there is an
apparent change to the timestamp whenever it has been written after
being actively observed via getattr.

For ext4, we only need to enable the FS_MGTIME flag.

Signed-off-by: Jeff Layton 
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b54c70e1a74e..cb1ff47af156 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -7279,7 +7279,7 @@ static struct file_system_type ext4_fs_type = {
.init_fs_context= ext4_init_fs_context,
.parameters = ext4_param_specs,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
+   .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
 };
 MODULE_ALIAS_FS("ext4");
 

-- 
2.41.0



[Cluster-devel] [PATCH v5 1/8] fs: pass the request_mask to generic_fillattr

2023-07-13 Thread Jeff Layton
generic_fillattr just fills in the entire stat struct indiscriminately
today, copying data from the inode. There is at least one attribute
(STATX_CHANGE_COOKIE) that can have side effects when it is reported,
and we're looking at adding more with the addition of multigrain
timestamps.

Add a request_mask argument to generic_fillattr and have most callers
just pass in the value that is passed to getattr. Have other callers
(e.g. ksmbd) just pass in STATX_BASIC_STATS. Also move the setting of
STATX_CHANGE_COOKIE into generic_fillattr.

Signed-off-by: Jeff Layton 
---
 fs/9p/vfs_inode.c   |  4 ++--
 fs/9p/vfs_inode_dotl.c  |  4 ++--
 fs/afs/inode.c  |  2 +-
 fs/btrfs/inode.c|  2 +-
 fs/ceph/inode.c |  2 +-
 fs/coda/inode.c |  3 ++-
 fs/ecryptfs/inode.c |  5 +++--
 fs/erofs/inode.c|  2 +-
 fs/exfat/file.c |  2 +-
 fs/ext2/inode.c |  2 +-
 fs/ext4/inode.c |  2 +-
 fs/f2fs/file.c  |  2 +-
 fs/fat/file.c   |  2 +-
 fs/fuse/dir.c   |  2 +-
 fs/gfs2/inode.c |  2 +-
 fs/hfsplus/inode.c  |  2 +-
 fs/kernfs/inode.c   |  2 +-
 fs/libfs.c  |  4 ++--
 fs/minix/inode.c|  2 +-
 fs/nfs/inode.c  |  2 +-
 fs/nfs/namespace.c  |  3 ++-
 fs/ntfs3/file.c |  2 +-
 fs/ocfs2/file.c |  2 +-
 fs/orangefs/inode.c |  2 +-
 fs/proc/base.c  |  4 ++--
 fs/proc/fd.c|  2 +-
 fs/proc/generic.c   |  2 +-
 fs/proc/proc_net.c  |  2 +-
 fs/proc/proc_sysctl.c   |  2 +-
 fs/proc/root.c  |  3 ++-
 fs/smb/client/inode.c   |  2 +-
 fs/smb/server/smb2pdu.c | 22 +++---
 fs/smb/server/vfs.c |  3 ++-
 fs/stat.c   | 18 ++
 fs/sysv/itree.c |  3 ++-
 fs/ubifs/dir.c  |  2 +-
 fs/udf/symlink.c|  2 +-
 fs/vboxsf/utils.c   |  2 +-
 include/linux/fs.h  |  2 +-
 mm/shmem.c  |  2 +-
 40 files changed, 70 insertions(+), 62 deletions(-)

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 16d85e6033a3..d24d1f20e922 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1016,7 +1016,7 @@ v9fs_vfs_getattr(struct mnt_idmap *idmap, const struct 
path *path,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) {
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
return 0;
} else if (v9ses->cache & CACHE_WRITEBACK) {
if (S_ISREG(inode->i_mode)) {
@@ -1037,7 +1037,7 @@ v9fs_vfs_getattr(struct mnt_idmap *idmap, const struct 
path *path,
return PTR_ERR(st);
 
v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb, 0);
-   generic_fillattr(&nop_mnt_idmap, d_inode(dentry), stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(dentry), stat);
 
p9stat_free(st);
kfree(st);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 464ea73d1bf8..8e8d5d2a13d8 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -451,7 +451,7 @@ v9fs_vfs_getattr_dotl(struct mnt_idmap *idmap,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
if (v9ses->cache & (CACHE_META|CACHE_LOOSE)) {
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
return 0;
} else if (v9ses->cache) {
if (S_ISREG(inode->i_mode)) {
@@ -476,7 +476,7 @@ v9fs_vfs_getattr_dotl(struct mnt_idmap *idmap,
return PTR_ERR(st);
 
v9fs_stat2inode_dotl(st, d_inode(dentry), 0);
-   generic_fillattr(&nop_mnt_idmap, d_inode(dentry), stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, d_inode(dentry), stat);
/* Change block size to what the server returned */
stat->blksize = st->st_blksize;
 
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index 6b636f43f548..1c794a1896aa 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -773,7 +773,7 @@ int afs_getattr(struct mnt_idmap *idmap, const struct path 
*path,
 
do {
read_seqbegin_or_lock(&vnode->cb_lock, &seq);
-   generic_fillattr(&nop_mnt_idmap, inode, stat);
+   generic_fillattr(&nop_mnt_idmap, request_mask, inode, stat);
if (test_bit(AFS_VNODE_SILLY_DELETED, &vnode->flags) &&
stat->nlink > 0)
stat->nlink -= 1;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ceac62c1cbfc..29a20f828dda 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8746,7 +8746,7 @@ static int btrfs_getattr(struct mnt_idmap *idmap,
   

[Cluster-devel] [PATCH v5 3/8] tmpfs: bump the mtime/ctime/iversion when page becomes writeable

2023-07-13 Thread Jeff Layton
Most filesystems that use the pagecache will update the mtime, ctime,
and change attribute when a page becomes writeable. Add a page_mkwrite
operation for tmpfs and just use it to bump the mtime, ctime and change
attribute.

This fixes xfstest generic/080 on tmpfs.

Signed-off-by: Jeff Layton 
---
 mm/shmem.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index b154af49d2df..654d9a585820 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2169,6 +2169,16 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
return ret;
 }
 
+static vm_fault_t shmem_page_mkwrite(struct vm_fault *vmf)
+{
+   struct vm_area_struct *vma = vmf->vma;
+   struct inode *inode = file_inode(vma->vm_file);
+
+   file_update_time(vma->vm_file);
+   inode_inc_iversion(inode);
+   return 0;
+}
+
 unsigned long shmem_get_unmapped_area(struct file *file,
  unsigned long uaddr, unsigned long len,
  unsigned long pgoff, unsigned long flags)
@@ -4210,6 +4220,7 @@ static const struct super_operations shmem_ops = {
 
 static const struct vm_operations_struct shmem_vm_ops = {
.fault  = shmem_fault,
+   .page_mkwrite   = shmem_page_mkwrite,
.map_pages  = filemap_map_pages,
 #ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,
@@ -4219,6 +4230,7 @@ static const struct vm_operations_struct shmem_vm_ops = {
 
 static const struct vm_operations_struct shmem_anon_vm_ops = {
.fault  = shmem_fault,
+   .page_mkwrite   = shmem_page_mkwrite,
.map_pages  = filemap_map_pages,
 #ifdef CONFIG_NUMA
.set_policy = shmem_set_policy,

-- 
2.41.0



[Cluster-devel] [PATCH v5 0/8] fs: implement multigrain timestamps

2023-07-13 Thread Jeff Layton
The VFS always uses coarse-grained timestamps when updating the
ctime and mtime after a change. This has the benefit of allowing
filesystems to optimize away a lot metadata updates, down to around 1
per jiffy, even when a file is under heavy writes.

Unfortunately, this coarseness has always been an issue when we're
exporting via NFSv3, which relies on timestamps to validate caches. A
lot of changes can happen in a jiffy, so timestamps aren't sufficient to
help the client decide to invalidate the cache.

Even with NFSv4, a lot of exported filesystems don't properly support a
change attribute and are subject to the same problems with timestamp
granularity. Other applications have similar issues with timestamps (e.g
backup applications).

If we were to always use fine-grained timestamps, that would improve the
situation, but that becomes rather expensive, as the underlying
filesystem would have to log a lot more metadata updates.

What we need is a way to only use fine-grained timestamps when they are
being actively queried. The idea is to use an unused bit in the ctime's
tv_nsec field to mark when the mtime or ctime has been queried via
getattr. Once that has been marked, the next m/ctime update will use a
fine-grained timestamp.

This patch series is based on top of Christian's vfs.all branch, which
has the recent conversion to the new ctime accessors. It should apply
cleanly on top of linux-next.

While the patchset does work, I'm mostly looking for feedback on the
core infrastructure API. Does this look reasonable? Am I missing any
races?

Signed-off-by: Jeff Layton 
base-commit: cf22d118b89a09a0160586412160d89098f7c4c7
---
Jeff Layton (8):
  fs: pass the request_mask to generic_fillattr
  fs: add infrastructure for multigrain timestamps
  tmpfs: bump the mtime/ctime/iversion when page becomes writeable
  tmpfs: add support for multigrain timestamps
  xfs: XFS_ICHGTIME_CREATE is unused
  xfs: switch to multigrain timestamps
  ext4: switch to multigrain timestamps
  btrfs: convert to multigrain timestamps

 fs/9p/vfs_inode.c   |  4 +-
 fs/9p/vfs_inode_dotl.c  |  4 +-
 fs/afs/inode.c  |  2 +-
 fs/btrfs/file.c | 24 ++
 fs/btrfs/inode.c|  2 +-
 fs/btrfs/super.c|  5 ++-
 fs/ceph/inode.c |  2 +-
 fs/coda/inode.c |  3 +-
 fs/ecryptfs/inode.c |  5 ++-
 fs/erofs/inode.c|  2 +-
 fs/exfat/file.c |  2 +-
 fs/ext2/inode.c |  2 +-
 fs/ext4/inode.c |  2 +-
 fs/ext4/super.c |  2 +-
 fs/f2fs/file.c  |  2 +-
 fs/fat/file.c   |  2 +-
 fs/fuse/dir.c   |  2 +-
 fs/gfs2/inode.c |  2 +-
 fs/hfsplus/inode.c  |  2 +-
 fs/inode.c  | 98 +
 fs/kernfs/inode.c   |  2 +-
 fs/libfs.c  |  4 +-
 fs/minix/inode.c|  2 +-
 fs/nfs/inode.c  |  2 +-
 fs/nfs/namespace.c  |  3 +-
 fs/ntfs3/file.c |  2 +-
 fs/ocfs2/file.c |  2 +-
 fs/orangefs/inode.c |  2 +-
 fs/proc/base.c  |  4 +-
 fs/proc/fd.c|  2 +-
 fs/proc/generic.c   |  2 +-
 fs/proc/proc_net.c  |  2 +-
 fs/proc/proc_sysctl.c   |  2 +-
 fs/proc/root.c  |  3 +-
 fs/smb/client/inode.c   |  2 +-
 fs/smb/server/smb2pdu.c | 22 -
 fs/smb/server/vfs.c |  3 +-
 fs/stat.c   | 59 -
 fs/sysv/itree.c |  3 +-
 fs/ubifs/dir.c  |  2 +-
 fs/udf/symlink.c|  2 +-
 fs/vboxsf/utils.c   |  2 +-
 fs/xfs/libxfs/xfs_shared.h  |  2 -
 fs/xfs/libxfs/xfs_trans_inode.c |  8 ++--
 fs/xfs/xfs_iops.c   |  4 +-
 fs/xfs/xfs_super.c  |  2 +-
 include/linux/fs.h  | 47 ++--
 mm/shmem.c  | 16 ++-
 48 files changed, 248 insertions(+), 129 deletions(-)
---
base-commit: cf22d118b89a09a0160586412160d89098f7c4c7
change-id: 20230713-mgctime-f2a9fc324918

Best regards,
-- 
Jeff Layton 



[Cluster-devel] [PATCH v5 5/8] xfs: XFS_ICHGTIME_CREATE is unused

2023-07-13 Thread Jeff Layton
Nothing ever sets this flag, which makes sense since the create time is
set at inode instantiation and is never changed. Remove it and the
handling of it in xfs_trans_ichgtime.

Signed-off-by: Jeff Layton 
---
 fs/xfs/libxfs/xfs_shared.h  | 2 --
 fs/xfs/libxfs/xfs_trans_inode.c | 2 --
 2 files changed, 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index c4381388c0c1..8989fff21723 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -126,8 +126,6 @@ voidxfs_log_get_max_trans_res(struct xfs_mount *mp,
  */
 #defineXFS_ICHGTIME_MOD0x1 /* data fork modification 
timestamp */
 #defineXFS_ICHGTIME_CHG0x2 /* inode field change timestamp 
*/
-#defineXFS_ICHGTIME_CREATE 0x4 /* inode create timestamp */
-
 
 /*
  * Symlink decoding/encoding functions
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 6b2296ff248a..0c9df8df6d4a 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -68,8 +68,6 @@ xfs_trans_ichgtime(
inode->i_mtime = tv;
if (flags & XFS_ICHGTIME_CHG)
inode_set_ctime_to_ts(inode, tv);
-   if (flags & XFS_ICHGTIME_CREATE)
-   ip->i_crtime = tv;
 }
 
 /*

-- 
2.41.0



[Cluster-devel] [PATCH] gfs2: fix timestamp handling on quota inodes

2023-07-13 Thread Jeff Layton
While these aren't generally visible from userland, it's best to be
consistent with timestamp handling. When adjusting the quota, update the
mtime and ctime like we would with a write operation on any other inode,
and avoid updating the atime which should only be done for reads.

Signed-off-by: Jeff Layton 
---
 fs/gfs2/quota.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Christian,

Would you mind picking this into the vfs.ctime branch, assuming the GFS2
maintainers ack it? Andreas and I had discussed this privately, and I
think it makes sense as part of that series.

Thanks,
Jeff

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 704192b73605..aa5fd06d47bc 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -871,7 +871,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, loff_t 
loc,
size = loc + sizeof(struct gfs2_quota);
if (size > inode->i_size)
i_size_write(inode, size);
-   inode->i_mtime = inode->i_atime = current_time(inode);
+   inode->i_mtime = inode_set_ctime_current(inode);
mark_inode_dirty(inode);
set_bit(QDF_REFRESH, &qd->qd_flags);
}
-- 
2.41.0



Re: [Cluster-devel] [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-10 Thread Jeff Layton
On Mon, 2023-07-10 at 14:35 +0200, Christian Brauner wrote:
> On Fri, Jul 07, 2023 at 08:42:31AM -0400, Jeff Layton wrote:
> > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> > > v2:
> > > - prepend patches to add missing ctime updates
> > > - add simple_rename_timestamp helper function
> > > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> > > - drop individual inode_ctime_set_{sec,nsec} helpers
> > > 
> > 
> > After review by Jan and others, and Jan's ext4 rework, the diff on top
> > of the series I posted a couple of days ago is below. I don't really
> > want to spam everyone with another ~100 patch v3 series, but I can if
> > you think that's best.
> > 
> > Christian, what would you like me to do here?
> 
> I picked up the series from the list and folded the fixups you posted
> here into the respective fs conversion patches. I hope that helps you
> avoid a resend. You should have received a separate "thank you" mail for
> all of this.
> 
> To each patch that I folded one of the fixlets from below into I added a
> git note that records a link to your mail here and the respective patch
> hunk from this mail that I folded into the patch. git.kernel.org will
> show notes by default. For example,
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.ctime&id=8b0e3c2e99004609a16ba145bcbdfdddb78e220e
> should show you the note I added. You can also fetch them via
> git fetch $remote refs/notes/*:refs/notes/*
> (You probably know that ofc but jic.) if you're interested.
> 
> Based on v6.5-rc1 as of today.
> 

Many thanks!!! I'll get to work rebasing the multigrain timestamp series
on top of that.

> Btw, both b4 and patchwork somehow treat the series in weird was.
> IOW, based on the message id of the cover letter I was able to pull most
> messages except for:
> 
> [07/92] fs: add ctime accessors infrastructure
> [08/92] fs: new helper: simple_rename_timestamp
> [92/92] fs: rename i_ctime field to __i_ctime
> 
> which I pulled in separately. Not sure what the cause of 
> 
> this is.

Good to know.

I ended up doing the send in two phases: one for the cover letter and
infrastructure patches that went to everyone, and one for the per-
subsystem patches that went do individual maintainers and lists.

I suspect that screwed up the message IDs somehow. Hopefully I won't
need to do a posting like that again soon, but I'll pay closer attention
to the message id handling next time.

Thanks again!
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-07 Thread Jeff Layton
On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> v2:
> - prepend patches to add missing ctime updates
> - add simple_rename_timestamp helper function
> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> - drop individual inode_ctime_set_{sec,nsec} helpers
> 

After review by Jan and others, and Jan's ext4 rework, the diff on top
of the series I posted a couple of days ago is below. I don't really
want to spam everyone with another ~100 patch v3 series, but I can if
you think that's best.

Christian, what would you like me to do here?

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index bcdb1a0beccf..5f6e93714f5a 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -699,8 +699,7 @@ void ceph_fill_file_time(struct inode *inode, int issued,
if (ci->i_version == 0 ||
timespec64_compare(ctime, &ictime) > 0) {
dout("ctime %lld.%09ld -> %lld.%09ld inc w/ cap\n",
-inode_get_ctime(inode).tv_sec,
-inode_get_ctime(inode).tv_nsec,
+ictime.tv_sec, ictime.tv_nsec,
 ctime->tv_sec, ctime->tv_nsec);
inode_set_ctime_to_ts(inode, *ctime);
}
diff --git a/fs/erofs/inode.c b/fs/erofs/inode.c
index 806374d866d1..567c0d305ea4 100644
--- a/fs/erofs/inode.c
+++ b/fs/erofs/inode.c
@@ -175,10 +175,7 @@ static void *erofs_read_inode(struct erofs_buf *buf,
vi->chunkbits = sb->s_blocksize_bits +
(vi->chunkformat & EROFS_CHUNK_FORMAT_BLKBITS_MASK);
}
-   inode->i_mtime.tv_sec = inode_get_ctime(inode).tv_sec;
-   inode->i_atime.tv_sec = inode_get_ctime(inode).tv_sec;
-   inode->i_mtime.tv_nsec = inode_get_ctime(inode).tv_nsec;
-   inode->i_atime.tv_nsec = inode_get_ctime(inode).tv_nsec;
+   inode->i_mtime = inode->i_atime = inode_get_ctime(inode);
 
inode->i_flags &= ~S_DAX;
if (test_opt(&sbi->opt, DAX_ALWAYS) && S_ISREG(inode->i_mode) &&
diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index c007de6ac1c7..1b9f587f6cca 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -1351,7 +1351,7 @@ static int exfat_rename(struct mnt_idmap *idmap,
exfat_warn(sb, "abnormal access to an inode dropped");
WARN_ON(new_inode->i_nlink == 0);
}
-   EXFAT_I(new_inode)->i_crtime = 
inode_set_ctime_current(new_inode);
+   EXFAT_I(new_inode)->i_crtime = current_time(new_inode);
}
 
 unlock:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d502b930431b..d63543187359 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -868,64 +868,63 @@ struct ext4_inode {
  * affected filesystem before 2242.
  */
 
-static inline __le32 ext4_encode_extra_time(struct timespec64 *time)
+static inline __le32 ext4_encode_extra_time(struct timespec64 ts)
 {
-   u32 extra =((time->tv_sec - (s32)time->tv_sec) >> 32) & EXT4_EPOCH_MASK;
-   return cpu_to_le32(extra | (time->tv_nsec << EXT4_EPOCH_BITS));
+   u32 extra = ((ts.tv_sec - (s32)ts.tv_sec) >> 32) & EXT4_EPOCH_MASK;
+   return cpu_to_le32(extra | (ts.tv_nsec << EXT4_EPOCH_BITS));
 }
 
-static inline void ext4_decode_extra_time(struct timespec64 *time,
- __le32 extra)
+static inline struct timespec64 ext4_decode_extra_time(__le32 base,
+  __le32 extra)
 {
+   struct timespec64 ts = { .tv_sec = le32_to_cpu(base) };
+
if (unlikely(extra & cpu_to_le32(EXT4_EPOCH_MASK)))
-   time->tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 
32;
-   time->tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> 
EXT4_EPOCH_BITS;
+   ts.tv_sec += (u64)(le32_to_cpu(extra) & EXT4_EPOCH_MASK) << 32;
+   ts.tv_nsec = (le32_to_cpu(extra) & EXT4_NSEC_MASK) >> EXT4_EPOCH_BITS;
+   return ts;
 }
 
-#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)  
\
+#define EXT4_INODE_SET_XTIME_VAL(xtime, inode, raw_inode, ts)  
\
 do {   
\
-   if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) 
{\
-   (raw_inode)->xtime = cpu_to_le32((inode)->xtime.tv_sec);
\
-   (raw_inode)->xtime ## _extra =  
\
-   ext4_encode_extra_time(&(inode)->xtime);
\
-   }   
\
-   else\
-   (raw_inode)->xtime = cpu_to_le32(clamp_t(int

Re: [Cluster-devel] [apparmor] [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-07 Thread Jeff Layton
On Thu, 2023-07-06 at 21:02 +, Seth Arnold wrote:
> On Wed, Jul 05, 2023 at 08:04:41PM -0400, Jeff Layton wrote:
> > 
> > I don't believe it's an issue. I've seen nothing in the POSIX spec that
> > mandates that timestamp updates to different inodes involved in an
> > operation be set to the _same_ value. It just says they must be updated.
> > 
> > It's also hard to believe that any software would depend on this either,
> > given that it's very inconsistent across filesystems today. AFAICT, this
> > was mostly done in the past just as a matter of convenience.
> 
> I've seen this assumption in several programs:
> 

Thanks for looking into this!

To be clear, POSIX doesn't require that _different_ inodes ever be set
to the same timestamp value. IOW, it certainly doesn't require that the
source and target directories on a rename() end up with the exact same
timestamp value.

Granted, POSIX is rather vague on timestamps in general, but most of the
examples below involve comparing different timestamps on the _same_
inode.


> mutt buffy.c
> https://sources.debian.org/src/mutt/2.2.9-1/buffy.c/?hl=625#L625
> 
>   if (mailbox->newly_created &&
>   (sb->st_ctime != sb->st_mtime || sb->st_ctime != sb->st_atime))
> mailbox->newly_created = 0;
> 

This should be fine with this patchset. Note that this is comparing
a/c/mtime on the same inode, and our usual pattern on inode
instantiation is:

inode->i_atime = inode->i_mtime = inode_set_ctime_current(inode);

...which should result in all of inode's timestamps being synchronized.

> 
> neomutt mbox/mbox.c
> https://sources.debian.org/src/neomutt/20220429+dfsg1-4.1/mbox/mbox.c/?hl=1820#L1820
> 
>   if (m->newly_created && ((st.st_ctime != st.st_mtime) || (st.st_ctime != 
> st.st_atime)))
> m->newly_created = false;
> 

Ditto here.

> 
> screen logfile.c
> https://sources.debian.org/src/screen/4.9.0-4/logfile.c/?hl=130#L130
> 
>   if ((!s->st_dev && !s->st_ino) || /* stat failed, that's new! */
>   !s->st_nlink ||   /* red alert: file unlinked */
>   (s->st_size < o.st_size) ||   /*   file truncated */
>   (s->st_mtime != o.st_mtime) ||/*file modified */
>   ((s->st_ctime != o.st_ctime) &&   /* file changed (moved) */
>!(s->st_mtime == s->st_ctime &&  /*  and it was not a change */
>  o.st_ctime < s->st_ctime)))/* due to delayed nfs write */
>   {
> 

This one is really weird. You have two different struct stat's, "o" and
"s". I assume though that these should be stat values from the same
inode, because otherwise this comparison would make no sense:

  ((s->st_ctime != o.st_ctime) &&   /* file changed (moved) */

In general, we can never contrive to ensure that the ctime of two
different inodes are the same, since that is always set by the kernel to
the current time, and you'd have to ensure that they were created within
the same jiffy (at least with today's code).

> nemo libnemo-private/nemo-vfs-file.c
> https://sources.debian.org/src/nemo/5.6.5-1/libnemo-private/nemo-vfs-file.c/?hl=344#L344
> 
>   /* mtime is when the contents changed; ctime is when the
>* contents or the permissions (inc. owner/group) changed.
>* So we can only know when the permissions changed if mtime
>* and ctime are different.
>*/
>   if (file->details->mtime == file->details->ctime) {
>   return FALSE;
>   }
> 

Ditto here with the first examples. This involves comparing timestamps
on the same inode, which should be fine.

> 
> While looking for more examples, I found a perl test that seems to suggest
> that at least Solaris, AFS, AmigaOS, DragonFly BSD do as you suggest:
> https://sources.debian.org/src/perl/5.36.0-7/t/op/stat.t/?hl=158#L140
> 

(I kinda miss Perl. I wrote a bunch of stuff in it in the 90's and early
naughties)

I think this test is supposed to be testing whether the mtime changes on
link() ?

-8<
my($nlink, $mtime, $ctime) = (stat($tmpfile))[$NLINK, $MTIME, $CTIME];

[...]


skip "Solaris tmpfs has different mtime/ctime link semantics", 2
 if $Is_Solaris and $cwd =~ m#^/tmp# and
$mtime && $mtime == $ctime;
-8<

...again, I think this would be ok too since it's just comparing the
mtime and ctime of the same inode. Granted this is a Solaris-specific
test, but Linux would be fine here too.

So in conclusion, I don't think this patchset will cause problems with
any of the above code.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-06 Thread Jeff Layton
On Thu, 2023-07-06 at 10:16 -0500, Eric W. Biederman wrote:
> Jeff Layton  writes:
> 
> > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> > > v2:
> > > - prepend patches to add missing ctime updates
> > > - add simple_rename_timestamp helper function
> > > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> > > - drop individual inode_ctime_set_{sec,nsec} helpers
> > > 
> > > I've been working on a patchset to change how the inode->i_ctime is
> > > accessed in order to give us conditional, high-res timestamps for the
> > > ctime and mtime. struct timespec64 has unused bits in it that we can use
> > > to implement this. In order to do that however, we need to wrap all
> > > accesses of inode->i_ctime to ensure that bits used as flags are
> > > appropriately handled.
> > > 
> > > The patchset starts with reposts of some missing ctime updates that I
> > > spotted in the tree. It then adds a new helper function for updating the
> > > timestamp after a successful rename, and new ctime accessor
> > > infrastructure.
> > > 
> > > The bulk of the patchset is individual conversions of different
> > > subsysteme to use the new infrastructure. Finally, the patchset renames
> > > the i_ctime field to __i_ctime to help ensure that I didn't miss
> > > anything.
> > > 
> > > This should apply cleanly to linux-next as of this morning.
> > > 
> > > Most of this conversion was done via 5 different coccinelle scripts, run
> > > in succession, with a large swath of by-hand conversions to clean up the
> > > remainder.
> > > 
> > 
> > A couple of other things I should note:
> > 
> > If you sent me an Acked-by or Reviewed-by in the previous set, then I
> > tried to keep it on the patch here, since the respun patches are mostly
> > just renaming stuff from v1. Let me know if I've missed any.
> > 
> > I've also pushed the pile to my tree as this tag:
> > 
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705
> > 
> > In case that's easier to work with.
> 
> Are there any preliminary patches showing what you want your introduced
> accessors to turn into?  It is hard to judge the sanity of the
> introduction of wrappers without seeing what the wrappers are ultimately
> going to do.
> 
> Eric

I have a draft version of the multigrain patches on top of the wrapper
conversion I've already posted in my "mgctime-experimental" branch:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=mgctime-experimental

The rationale is best explained in this changelog:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=mgctime-experimental&id=face437a144d3375afb7f70c233b0644b4edccba

The idea will be to enable this on a per-fs basis.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-05 Thread Jeff Layton
On Thu, 2023-07-06 at 08:19 +0900, Damien Le Moal wrote:
> On 7/6/23 03:58, Jeff Layton wrote:
> > A rename potentially involves updating 4 different inode timestamps. Add
> > a function that handles the details sanely, and convert the libfs.c
> > callers to use it.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/libfs.c | 36 +++-
> >  include/linux/fs.h |  2 ++
> >  2 files changed, 29 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/libfs.c b/fs/libfs.c
> > index a7e56baf8bbd..9ee79668c909 100644
> > --- a/fs/libfs.c
> > +++ b/fs/libfs.c
> > @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry 
> > *dentry)
> >  }
> >  EXPORT_SYMBOL(simple_rmdir);
> >  
> > +/**
> > + * simple_rename_timestamp - update the various inode timestamps for rename
> > + * @old_dir: old parent directory
> > + * @old_dentry: dentry that is being renamed
> > + * @new_dir: new parent directory
> > + * @new_dentry: target for rename
> > + *
> > + * POSIX mandates that the old and new parent directories have their ctime 
> > and
> > + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> > have
> > + * their ctime updated.
> > + */
> > +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> > *old_dentry,
> > +struct inode *new_dir, struct dentry *new_dentry)
> > +{
> > +   struct inode *newino = d_inode(new_dentry);
> > +
> > +   old_dir->i_mtime = inode_set_ctime_current(old_dir);
> > +   if (new_dir != old_dir)
> > +   new_dir->i_mtime = inode_set_ctime_current(new_dir);
> > +   inode_set_ctime_current(d_inode(old_dentry));
> > +   if (newino)
> > +   inode_set_ctime_current(newino);
> > +}
> > +EXPORT_SYMBOL_GPL(simple_rename_timestamp);
> > +
> >  int simple_rename_exchange(struct inode *old_dir, struct dentry 
> > *old_dentry,
> >struct inode *new_dir, struct dentry *new_dentry)
> >  {
> > @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, 
> > struct dentry *old_dentry,
> > inc_nlink(old_dir);
> > }
> > }
> > -   old_dir->i_ctime = old_dir->i_mtime =
> > -   new_dir->i_ctime = new_dir->i_mtime =
> > -   d_inode(old_dentry)->i_ctime =
> > -   d_inode(new_dentry)->i_ctime = current_time(old_dir);
> > -
> > +   simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
> 
> This is somewhat changing the current behavior: before the patch, the mtime 
> and
> ctime of old_dir, new_dir and the inodes associated with the dentries are 
> always
> equal. But given that simple_rename_timestamp() calls 
> inode_set_ctime_current()
> 4 times, the times could potentially be different.
> 
> I am not sure if that is an issue, but it seems that calling
> inode_set_ctime_current() once, recording the "now" time it sets and using 
> that
> value to set all times may be more efficient and preserve the existing 
> behavior.
> 

I don't believe it's an issue. I've seen nothing in the POSIX spec that
mandates that timestamp updates to different inodes involved in an
operation be set to the _same_ value. It just says they must be updated.

It's also hard to believe that any software would depend on this either,
given that it's very inconsistent across filesystems today. AFAICT, this
was mostly done in the past just as a matter of convenience.

The other problem with doing it that way is that it assumes that
current_time(inode) should always return the same value when given
different inodes. Is it really correct to do this?

inode_set_ctime(dir, inode_set_ctime_current(inode));

"dir" and "inode" are different inodes, after all, and you're setting
dir's timestamp to "inode"'s value. It's not a big deal today since
they're always on the same sb, but the ultimate goal of these changes is
to implement multigrain timestamps. That will mean that fetching a fine-
grained timestamp for an update when the existing mtime or ctime value
has been queried via getattr.

With that change, I think it's best that we treat updates to different
inodes individually, as some of them may require updating with a fine-
grained timestamp and some may not.

> > return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(simple_rename_exchange);
> > @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> > *old_dir,
> >

Re: [Cluster-devel] [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-05 Thread Jeff Layton
On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> v2:
> - prepend patches to add missing ctime updates
> - add simple_rename_timestamp helper function
> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> - drop individual inode_ctime_set_{sec,nsec} helpers
> 
> I've been working on a patchset to change how the inode->i_ctime is
> accessed in order to give us conditional, high-res timestamps for the
> ctime and mtime. struct timespec64 has unused bits in it that we can use
> to implement this. In order to do that however, we need to wrap all
> accesses of inode->i_ctime to ensure that bits used as flags are
> appropriately handled.
> 
> The patchset starts with reposts of some missing ctime updates that I
> spotted in the tree. It then adds a new helper function for updating the
> timestamp after a successful rename, and new ctime accessor
> infrastructure.
> 
> The bulk of the patchset is individual conversions of different
> subsysteme to use the new infrastructure. Finally, the patchset renames
> the i_ctime field to __i_ctime to help ensure that I didn't miss
> anything.
> 
> This should apply cleanly to linux-next as of this morning.
> 
> Most of this conversion was done via 5 different coccinelle scripts, run
> in succession, with a large swath of by-hand conversions to clean up the
> remainder.
> 

A couple of other things I should note:

If you sent me an Acked-by or Reviewed-by in the previous set, then I
tried to keep it on the patch here, since the respun patches are mostly
just renaming stuff from v1. Let me know if I've missed any.

I've also pushed the pile to my tree as this tag:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705

In case that's easier to work with.

Cheers,
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH 7/9] gfs2: update ctime when quota is updated

2023-07-05 Thread Jeff Layton
On Wed, 2023-07-05 at 22:25 +0200, Andreas Gruenbacher wrote:
> On Mon, Jun 12, 2023 at 12:36 PM Jeff Layton  wrote:
> > On Fri, 2023-06-09 at 18:44 +0200, Andreas Gruenbacher wrote:
> > > Jeff,
> > > 
> > > On Fri, Jun 9, 2023 at 2:50 PM Jeff Layton  wrote:
> > > > Signed-off-by: Jeff Layton 
> > > > ---
> > > >  fs/gfs2/quota.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > > > index 1ed17226d9ed..6d283e071b90 100644
> > > > --- a/fs/gfs2/quota.c
> > > > +++ b/fs/gfs2/quota.c
> > > > @@ -869,7 +869,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, 
> > > > loff_t loc,
> > > > size = loc + sizeof(struct gfs2_quota);
> > > > if (size > inode->i_size)
> > > > i_size_write(inode, size);
> > > > -   inode->i_mtime = inode->i_atime = current_time(inode);
> > > > +   inode->i_mtime = inode->i_atime = inode->i_ctime = 
> > > > current_time(inode);
> > > 
> > > I don't think we need to worry about the ctime of the quota inode as
> > > that inode is internal to the filesystem only.
> > > 
> > 
> > Thanks Andreas.  I'll plan to drop this patch from the series for now.
> > 
> > Does updating the mtime and atime here serve any purpose, or should
> > those also be removed? If you plan to keep the a/mtime updates then I'd
> > still suggest updating the ctime for consistency's sake. It shouldn't
> > cost anything extra to do so since you're dirtying the inode below
> > anyway.
> 
> Yes, good point actually, we should keep things consistent for simplicity.
> 
> Would you add this back in if you do another posting?
> 

I just re-posted the other patches in this as part of the ctime accessor
conversion. If I post again though, I can resurrect the gfs2 patch. If
not, we can do a follow-on fix later.

Since we're discussing it, it may be more correct to remove the atime
update there. gfs2_adjust_quota sounds like a "modify" operation, not a
"read", so I don't see a reason to update the atime.

In general, the only time you only want to set the atime, ctime and
mtime in lockstep is when the inode is brand new.
-- 
Jeff Layton 



[Cluster-devel] [PATCH v2 47/92] gfs2: convert to ctime accessor functions

2023-07-05 Thread Jeff Layton
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton 
---
 fs/gfs2/acl.c   |  2 +-
 fs/gfs2/bmap.c  | 11 +--
 fs/gfs2/dir.c   | 15 ---
 fs/gfs2/file.c  |  2 +-
 fs/gfs2/glops.c |  4 ++--
 fs/gfs2/inode.c |  8 
 fs/gfs2/super.c |  4 ++--
 fs/gfs2/xattr.c |  8 
 8 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
index a392aa0f041d..443640e6fb9c 100644
--- a/fs/gfs2/acl.c
+++ b/fs/gfs2/acl.c
@@ -142,7 +142,7 @@ int gfs2_set_acl(struct mnt_idmap *idmap, struct dentry 
*dentry,
 
ret = __gfs2_set_acl(inode, acl, type);
if (!ret && mode != inode->i_mode) {
-   inode->i_ctime = current_time(inode);
+   inode_set_ctime_current(inode);
inode->i_mode = mode;
mark_inode_dirty(inode);
}
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 8d611fbcf0bd..45ea63f7167d 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1386,7 +1386,7 @@ static int trunc_start(struct inode *inode, u64 newsize)
ip->i_diskflags |= GFS2_DIF_TRUNC_IN_PROG;
 
i_size_write(inode, newsize);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
gfs2_dinode_out(ip, dibh->b_data);
 
if (journaled)
@@ -1583,8 +1583,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, 
struct gfs2_holder *rd_gh,
 
/* Every transaction boundary, we rewrite the dinode
   to keep its di_blocks current in case of failure. */
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime =
-   current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = 
inode_set_ctime_current(&ip->i_inode);
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
brelse(dibh);
@@ -1950,7 +1949,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, 
u64 length)
gfs2_statfs_change(sdp, 0, +btotal, 0);
gfs2_quota_change(ip, -(s64)btotal, ip->i_inode.i_uid,
  ip->i_inode.i_gid);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = 
current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
up_write(&ip->i_rw_mutex);
@@ -1993,7 +1992,7 @@ static int trunc_end(struct gfs2_inode *ip)
gfs2_buffer_clear_tail(dibh, sizeof(struct gfs2_dinode));
gfs2_ordered_del_inode(ip);
}
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
ip->i_diskflags &= ~GFS2_DIF_TRUNC_IN_PROG;
 
gfs2_trans_add_meta(ip->i_gl, dibh);
@@ -2094,7 +2093,7 @@ static int do_grow(struct inode *inode, u64 size)
goto do_end_trans;
 
truncate_setsize(inode, size);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
brelse(dibh);
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 54a6d17b8c25..1a2afa88f8be 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -130,7 +130,7 @@ static int gfs2_dir_write_stuffed(struct gfs2_inode *ip, 
const char *buf,
memcpy(dibh->b_data + offset + sizeof(struct gfs2_dinode), buf, size);
if (ip->i_inode.i_size < offset + size)
i_size_write(&ip->i_inode, offset + size);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
gfs2_dinode_out(ip, dibh->b_data);
 
brelse(dibh);
@@ -227,7 +227,7 @@ static int gfs2_dir_write_data(struct gfs2_inode *ip, const 
char *buf,
 
if (ip->i_inode.i_size < offset + copied)
i_size_write(&ip->i_inode, offset + copied);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_set_ctime_current(&ip->i_inode);
 
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
@@ -1814,7 +1814,7 @@ int gfs2_dir_add(struct inode *inode, const struct q

[Cluster-devel] [PATCH v2 92/92] fs: rename i_ctime field to __i_ctime

2023-07-05 Thread Jeff Layton
Now that everything in-tree is converted to use the accessor functions,
rename the i_ctime field in the inode to discourage direct access.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 14e38bd900f1..b66442f91835 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,7 +642,7 @@ struct inode {
loff_t  i_size;
struct timespec64   i_atime;
struct timespec64   i_mtime;
-   struct timespec64   i_ctime;
+   struct timespec64   __i_ctime; /* use inode_*_ctime accessors! */
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
u8  i_blkbits;
@@ -1485,7 +1485,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
*inode);
  */
 static inline struct timespec64 inode_get_ctime(const struct inode *inode)
 {
-   return inode->i_ctime;
+   return inode->__i_ctime;
 }
 
 /**
@@ -1498,7 +1498,7 @@ static inline struct timespec64 inode_get_ctime(const 
struct inode *inode)
 static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
  struct timespec64 ts)
 {
-   inode->i_ctime = ts;
+   inode->__i_ctime = ts;
return ts;
 }
 
-- 
2.41.0



[Cluster-devel] [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-05 Thread Jeff Layton
A rename potentially involves updating 4 different inode timestamps. Add
a function that handles the details sanely, and convert the libfs.c
callers to use it.

Signed-off-by: Jeff Layton 
---
 fs/libfs.c | 36 +++-
 include/linux/fs.h |  2 ++
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index a7e56baf8bbd..9ee79668c909 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry *dentry)
 }
 EXPORT_SYMBOL(simple_rmdir);
 
+/**
+ * simple_rename_timestamp - update the various inode timestamps for rename
+ * @old_dir: old parent directory
+ * @old_dentry: dentry that is being renamed
+ * @new_dir: new parent directory
+ * @new_dentry: target for rename
+ *
+ * POSIX mandates that the old and new parent directories have their ctime and
+ * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), have
+ * their ctime updated.
+ */
+void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry,
+struct inode *new_dir, struct dentry *new_dentry)
+{
+   struct inode *newino = d_inode(new_dentry);
+
+   old_dir->i_mtime = inode_set_ctime_current(old_dir);
+   if (new_dir != old_dir)
+   new_dir->i_mtime = inode_set_ctime_current(new_dir);
+   inode_set_ctime_current(d_inode(old_dentry));
+   if (newino)
+   inode_set_ctime_current(newino);
+}
+EXPORT_SYMBOL_GPL(simple_rename_timestamp);
+
 int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
   struct inode *new_dir, struct dentry *new_dentry)
 {
@@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct 
dentry *old_dentry,
inc_nlink(old_dir);
}
}
-   old_dir->i_ctime = old_dir->i_mtime =
-   new_dir->i_ctime = new_dir->i_mtime =
-   d_inode(old_dentry)->i_ctime =
-   d_inode(new_dentry)->i_ctime = current_time(old_dir);
-
+   simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
return 0;
 }
 EXPORT_SYMBOL_GPL(simple_rename_exchange);
@@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
*old_dir,
  struct dentry *old_dentry, struct inode *new_dir,
  struct dentry *new_dentry, unsigned int flags)
 {
-   struct inode *inode = d_inode(old_dentry);
int they_are_dirs = d_is_dir(old_dentry);
 
if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
@@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
*old_dir,
inc_nlink(new_dir);
}
 
-   old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
-   new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
-
+   simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
return 0;
 }
 EXPORT_SYMBOL(simple_rename);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bdfbd11a5811..14e38bd900f1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file 
*file);
 extern int simple_link(struct dentry *, struct inode *, struct dentry *);
 extern int simple_unlink(struct inode *, struct dentry *);
 extern int simple_rmdir(struct inode *, struct dentry *);
+void simple_rename_timestamp(struct inode *old_dir, struct dentry *old_dentry,
+struct inode *new_dir, struct dentry *new_dentry);
 extern int simple_rename_exchange(struct inode *old_dir, struct dentry 
*old_dentry,
  struct inode *new_dir, struct dentry 
*new_dentry);
 extern int simple_rename(struct mnt_idmap *, struct inode *,
-- 
2.41.0



[Cluster-devel] [PATCH v2 07/92] fs: add ctime accessors infrastructure

2023-07-05 Thread Jeff Layton
struct timespec64 has unused bits in the tv_nsec field that can be used
for other purposes. In future patches, we're going to change how the
inode->i_ctime is accessed in certain inodes in order to make use of
them. In order to do that safely though, we'll need to eradicate raw
accesses of the inode->i_ctime field from the kernel.

Add new accessor functions for the ctime that we use to replace them.

Reviewed-by: Jan Kara 
Reviewed-by: Luis Chamberlain 
Signed-off-by: Jeff Layton 
---
 fs/inode.c | 16 
 include/linux/fs.h | 45 -
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d37fad91c8da..21b026d95b51 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode *inode)
 }
 EXPORT_SYMBOL(current_time);
 
+/**
+ * inode_set_ctime_current - set the ctime to current_time
+ * @inode: inode
+ *
+ * Set the inode->i_ctime to the current value for the inode. Returns
+ * the current value that was assigned to i_ctime.
+ */
+struct timespec64 inode_set_ctime_current(struct inode *inode)
+{
+   struct timespec64 now = current_time(inode);
+
+   inode_set_ctime(inode, now.tv_sec, now.tv_nsec);
+   return now;
+}
+EXPORT_SYMBOL(inode_set_ctime_current);
+
 /**
  * in_group_or_capable - check whether caller is CAP_FSETID privileged
  * @idmap: idmap of the mount @inode was found from
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 824accb89a91..bdfbd11a5811 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1474,7 +1474,50 @@ static inline bool fsuidgid_has_mapping(struct 
super_block *sb,
   kgid_has_mapping(fs_userns, kgid);
 }
 
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
+struct timespec64 inode_set_ctime_current(struct inode *inode);
+
+/**
+ * inode_get_ctime - fetch the current ctime from the inode
+ * @inode: inode from which to fetch ctime
+ *
+ * Grab the current ctime from the inode and return it.
+ */
+static inline struct timespec64 inode_get_ctime(const struct inode *inode)
+{
+   return inode->i_ctime;
+}
+
+/**
+ * inode_set_ctime_to_ts - set the ctime in the inode
+ * @inode: inode in which to set the ctime
+ * @ts: value to set in the ctime field
+ *
+ * Set the ctime in @inode to @ts
+ */
+static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
+ struct timespec64 ts)
+{
+   inode->i_ctime = ts;
+   return ts;
+}
+
+/**
+ * inode_set_ctime - set the ctime in the inode
+ * @inode: inode in which to set the ctime
+ * @sec: tv_sec value to set
+ * @nsec: tv_nsec value to set
+ *
+ * Set the ctime in @inode to { @sec, @nsec }
+ */
+static inline struct timespec64 inode_set_ctime(struct inode *inode,
+   time64_t sec, long nsec)
+{
+   struct timespec64 ts = { .tv_sec  = sec,
+.tv_nsec = nsec };
+
+   return inode_set_ctime_to_ts(inode, ts);
+}
 
 /*
  * Snapshotting support.
-- 
2.41.0



[Cluster-devel] [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-05 Thread Jeff Layton
v2:
- prepend patches to add missing ctime updates
- add simple_rename_timestamp helper function
- rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
- drop individual inode_ctime_set_{sec,nsec} helpers

I've been working on a patchset to change how the inode->i_ctime is
accessed in order to give us conditional, high-res timestamps for the
ctime and mtime. struct timespec64 has unused bits in it that we can use
to implement this. In order to do that however, we need to wrap all
accesses of inode->i_ctime to ensure that bits used as flags are
appropriately handled.

The patchset starts with reposts of some missing ctime updates that I
spotted in the tree. It then adds a new helper function for updating the
timestamp after a successful rename, and new ctime accessor
infrastructure.

The bulk of the patchset is individual conversions of different
subsysteme to use the new infrastructure. Finally, the patchset renames
the i_ctime field to __i_ctime to help ensure that I didn't miss
anything.

This should apply cleanly to linux-next as of this morning.

Most of this conversion was done via 5 different coccinelle scripts, run
in succession, with a large swath of by-hand conversions to clean up the
remainder.

The coccinelle scripts that were used are below:

::
cocci/ctime1.cocci
::
// convert as much to use inode_set_ctime_current as possible
@@
identifier timei;
struct inode *inode;
expression E1, E2;
@@
(
- inode->i_ctime = E1 = E2 = current_time(timei)
+ E1 = E2 = inode_set_ctime_current(inode)
|
- inode->i_ctime = E1 = current_time(timei)
+ E1 = inode_set_ctime_current(inode)
|
- E1 = inode->i_ctime = current_time(timei)
+ E1 = inode_set_ctime_current(inode)
|
- inode->i_ctime = current_time(timei)
+ inode_set_ctime_current(inode)
)

@@
struct inode *inode;
expression E1, E2, E3;
@@
(
- E1 = current_time(inode)
+ E1 = inode_set_ctime_current(inode)
|
- E1 = current_time(E3)
+ E1 = inode_set_ctime_current(inode)
)
...
(
- inode->i_ctime = E1;
|
- E2 = inode->i_ctime = E1;
+ E2 = E1;
)
::
cocci/ctime2.cocci
::
// get the places that set individual timespec64 fields
@@
struct inode *inode;
expression val, val2;
@@
- inode->i_ctime.tv_sec = val
+ inode_set_ctime(inode, val, val2)
...
- inode->i_ctime.tv_nsec = val2;

// get places that just set the tv_sec
@@
struct inode *inode;
expression sec, E1, E2, E3;
@@
(
- E3 = inode->i_ctime.tv_sec = sec
+ E3 = inode_set_ctime(inode, sec, 0).tv_sec
|
- inode->i_ctime.tv_sec = sec
+ inode_set_ctime(inode, sec, 0)
)
<...
(
- inode->i_ctime.tv_nsec = 0;
|
- E1 = inode->i_ctime.tv_nsec = 0
+ E1 = 0
|
- inode->i_ctime.tv_nsec = E1 = 0
+ E1 = 0
|
- inode->i_ctime.tv_nsec = E1 = E2 = 0
+ E1 = E2 = 0
)
...>

::
cocci/ctime3.cocci
::
// convert places that set i_ctime to a timespec64 directly
@@
struct inode *inode;
expression ts, E1, E2;
@@
(
- inode->i_ctime = E1 = E2 = ts
+ E1 = E2 = inode_set_ctime_to_ts(inode, ts)
|
- inode->i_ctime = E1 = ts
+ E1 = inode_set_ctime_to_ts(inode, ts)
|
- inode->i_ctime = ts
+ inode_set_ctime_to_ts(inode, ts)
)
::
cocci/ctime4.cocci
::
// catch places that set the i_ctime in an inode embedded in another structure
@@
expression E1, E2, E3;
@@
(
- E3.i_ctime = E1 = E2 = current_time(&E3)
+ E1 = E2 = inode_set_ctime_current(&E3)
|
- E3.i_ctime = E1 = current_time(&E3)
+ E1 = inode_set_ctime_current(&E3)
|
- E1 = E3.i_ctime = current_time(&E3)
+ E1 = inode_set_ctime_current(&E3)
|
- E3.i_ctime = current_time(&E3)
+ inode_set_ctime_current(&E3)
)
::
cocci/ctime5.cocci
::
// convert the remaining i_ctime accesses
@@
struct inode *inode;
@@
- inode->i_ctime
+ inode_get_ctime(inode)


Jeff Layton (92):
  ibmvmc: update ctime in conjunction with mtime on write
  bfs: update ctime in addition to mtime when adding entries
  efivarfs: update ctime when mtime changes on a write
  exfat: ensure that ctime is updated whenever the mtime is
  apparmor: update ctime whenever the mtime changes on an inode
  cifs: update the ctime on a partial page write
  fs: add ctime accessors infrastructure
  fs: new helper: simple_rename_timestamp
  btrfs: convert to simple_rename_timestamp
  ubifs: convert to simple_rename_timestamp
  shmem: convert to simple_rename_timestamp
  exfat: convert to simple_rename_timestamp
  ntfs3: convert to simple_rename_timestamp
  reiserfs: convert to simple_rename_timestamp
  spufs: convert to ctime accessor functions
  s390: convert to ctime accessor functions
  binderfs: convert to ctime accessor functions
  infiniband: convert to ctime accessor functions
  ibm: convert to ctime accessor functions
  usb: convert to ctime accessor functions
  9p: convert to ctime accessor functions
  adfs: convert to ctime accessor functions
  affs: convert to ctime accessor functions
  afs: convert to ctim

Re: [Cluster-devel] [PATCH 01/79] fs: add ctime accessors infrastructure

2023-06-22 Thread Jeff Layton
On Thu, 2023-06-22 at 09:46 +0900, Damien Le Moal wrote:
> On 6/21/23 23:45, Jeff Layton wrote:
> > struct timespec64 has unused bits in the tv_nsec field that can be used
> > for other purposes. In future patches, we're going to change how the
> > inode->i_ctime is accessed in certain inodes in order to make use of
> > them. In order to do that safely though, we'll need to eradicate raw
> > accesses of the inode->i_ctime field from the kernel.
> > 
> > Add new accessor functions for the ctime that we can use to replace them.
> > 
> > Signed-off-by: Jeff Layton 
> 
> [...]
> 
> > +/**
> > + * inode_ctime_peek - fetch the current ctime from the inode
> > + * @inode: inode from which to fetch ctime
> > + *
> > + * Grab the current ctime from the inode and return it.
> > + */
> > +static inline struct timespec64 inode_ctime_peek(const struct inode *inode)
> 
> To be consistent with inode_ctime_set(), why not call this one 
> inode_ctime_get()

In later patches fetching the ctime for presentation may have side
effects on certain filesystems. Using "peek" here is a hint that we want
to avoid those side effects in these calls.

> ? Also, inode_set_ctime() & inode_get_ctime() may be a little more natural. 
> But
> no strong opinion about that though.
> 

I like the consistency of the inode_ctime_* prefix. It makes it simpler
to find these calls when grepping, etc.

That said, my opinions on naming are pretty loosely-held, so if the
consensus is that the names should as you suggest, I'll go along with
it.
-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH 00/79] fs: new accessors for inode->i_ctime

2023-06-21 Thread Jeff Layton
On Wed, 2023-06-21 at 15:21 -0400, Steven Rostedt wrote:
> On Wed, 21 Jun 2023 10:45:05 -0400
> Jeff Layton  wrote:
> 
> > Most of this conversion was done via coccinelle, with a few of the more
> > non-standard accesses done by hand. There should be no behavioral
> > changes with this set. That will come later, as we convert individual
> > filesystems to use multigrain timestamps.
> 
> BTW, Linus has suggested to me that whenever a conccinelle script is used,
> it should be included in the change log.
> 

Ok, here's what I have. I note again that my usage of coccinelle is
pretty primitive, so I ended up doing a fair bit of by-hand fixing after
applying these.

Given the way that this change is broken up into 77 patches by
subsystem, to which changelogs should I add it? I could add it to the
"infrastructure" patch, but that's the one where I _didn't_ use it. 

Maybe to patch #79 (the one that renames i_ctime)?


8<--
@@
expression inode;
@@

- inode->i_ctime = current_time(inode)
+ inode_set_current_ctime(inode)

@@
expression inode;
@@

- inode->i_ctime = inode->i_mtime = current_time(inode)
+ inode->i_mtime = inode_set_current_ctime(inode)

@@
struct inode *inode;
expression value;
@@

- inode->i_ctime = value;
+ inode_set_ctime(inode, value);

@@
struct inode *inode;
expression val;
@@
- inode->i_ctime.tv_sec = val
+ inode_set_ctime_sec(inode, val)

@@
struct inode *inode;
expression val;
@@
- inode->i_ctime.tv_nsec = val
+ inode_set_ctime_nsec(inode, val)

@@
struct inode *inode;
@@
- inode->i_ctime
+ inode_ctime_peek(inode)



Re: [Cluster-devel] [PATCH 01/79] fs: add ctime accessors infrastructure

2023-06-21 Thread Jeff Layton
On Wed, 2023-06-21 at 14:19 -0400, Tom Talpey wrote:
> On 6/21/2023 2:01 PM, Jeff Layton wrote:
> > On Wed, 2023-06-21 at 13:29 -0400, Tom Talpey wrote:
> > > On 6/21/2023 10:45 AM, Jeff Layton wrote:
> > > > struct timespec64 has unused bits in the tv_nsec field that can be used
> > > > for other purposes. In future patches, we're going to change how the
> > > > inode->i_ctime is accessed in certain inodes in order to make use of
> > > > them. In order to do that safely though, we'll need to eradicate raw
> > > > accesses of the inode->i_ctime field from the kernel.
> > > > 
> > > > Add new accessor functions for the ctime that we can use to replace 
> > > > them.
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > > ---
> > > >fs/inode.c | 16 ++
> > > >include/linux/fs.h | 53 
> > > > +-
> > > >2 files changed, 68 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index d37fad91c8da..c005e7328fbb 100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode 
> > > > *inode)
> > > >}
> > > >EXPORT_SYMBOL(current_time);
> > > >
> > > > +/**
> > > > + * inode_ctime_set_current - set the ctime to current_time
> > > > + * @inode: inode
> > > > + *
> > > > + * Set the inode->i_ctime to the current value for the inode. Returns
> > > > + * the current value that was assigned to i_ctime.
> > > > + */
> > > > +struct timespec64 inode_ctime_set_current(struct inode *inode)
> > > > +{
> > > > +   struct timespec64 now = current_time(inode);
> > > > +
> > > > +   inode_set_ctime(inode, now);
> > > > +   return now;
> > > > +}
> > > > +EXPORT_SYMBOL(inode_ctime_set_current);
> > > > +
> > > >/**
> > > > * in_group_or_capable - check whether caller is CAP_FSETID 
> > > > privileged
> > > > * @idmap:   idmap of the mount @inode was found from
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 6867512907d6..9afb30606373 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -1474,7 +1474,58 @@ static inline bool fsuidgid_has_mapping(struct 
> > > > super_block *sb,
> > > >kgid_has_mapping(fs_userns, kgid);
> > > >}
> > > >
> > > > -extern struct timespec64 current_time(struct inode *inode);
> > > > +struct timespec64 current_time(struct inode *inode);
> > > > +struct timespec64 inode_ctime_set_current(struct inode *inode);
> > > > +
> > > > +/**
> > > > + * inode_ctime_peek - fetch the current ctime from the inode
> > > > + * @inode: inode from which to fetch ctime
> > > > + *
> > > > + * Grab the current ctime from the inode and return it.
> > > > + */
> > > > +static inline struct timespec64 inode_ctime_peek(const struct inode 
> > > > *inode)
> > > > +{
> > > > +   return inode->i_ctime;
> > > > +}
> > > > +
> > > > +/**
> > > > + * inode_ctime_set - set the ctime in the inode to the given value
> > > > + * @inode: inode in which to set the ctime
> > > > + * @ts: timespec value to set the ctime
> > > > + *
> > > > + * Set the ctime in @inode to @ts.
> > > > + */
> > > > +static inline struct timespec64 inode_ctime_set(struct inode *inode, 
> > > > struct timespec64 ts)
> > > > +{
> > > > +   inode->i_ctime = ts;
> > > > +   return ts;
> > > > +}
> > > > +
> > > > +/**
> > > > + * inode_ctime_set_sec - set only the tv_sec field in the inode ctime
> > > 
> > > I'm curious about why you choose to split the tv_sec and tv_nsec
> > > set_ functions. Do any callers not set them both? Wouldn't a
> > > single call enable a more atomic behavior someday?
> > > 
> > > inode_ctime_set_sec_nsec(struct inode *, time64_t, time64_t)
> > > 
> > > (or simply initialize a timespec64 and use in

Re: [Cluster-devel] [PATCH 01/79] fs: add ctime accessors infrastructure

2023-06-21 Thread Jeff Layton
On Wed, 2023-06-21 at 13:29 -0400, Tom Talpey wrote:
> On 6/21/2023 10:45 AM, Jeff Layton wrote:
> > struct timespec64 has unused bits in the tv_nsec field that can be used
> > for other purposes. In future patches, we're going to change how the
> > inode->i_ctime is accessed in certain inodes in order to make use of
> > them. In order to do that safely though, we'll need to eradicate raw
> > accesses of the inode->i_ctime field from the kernel.
> > 
> > Add new accessor functions for the ctime that we can use to replace them.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >   fs/inode.c | 16 ++
> >   include/linux/fs.h | 53 +-
> >   2 files changed, 68 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index d37fad91c8da..c005e7328fbb 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode *inode)
> >   }
> >   EXPORT_SYMBOL(current_time);
> >   
> > +/**
> > + * inode_ctime_set_current - set the ctime to current_time
> > + * @inode: inode
> > + *
> > + * Set the inode->i_ctime to the current value for the inode. Returns
> > + * the current value that was assigned to i_ctime.
> > + */
> > +struct timespec64 inode_ctime_set_current(struct inode *inode)
> > +{
> > +   struct timespec64 now = current_time(inode);
> > +
> > +   inode_set_ctime(inode, now);
> > +   return now;
> > +}
> > +EXPORT_SYMBOL(inode_ctime_set_current);
> > +
> >   /**
> >* in_group_or_capable - check whether caller is CAP_FSETID privileged
> >* @idmap:idmap of the mount @inode was found from
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 6867512907d6..9afb30606373 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1474,7 +1474,58 @@ static inline bool fsuidgid_has_mapping(struct 
> > super_block *sb,
> >kgid_has_mapping(fs_userns, kgid);
> >   }
> >   
> > -extern struct timespec64 current_time(struct inode *inode);
> > +struct timespec64 current_time(struct inode *inode);
> > +struct timespec64 inode_ctime_set_current(struct inode *inode);
> > +
> > +/**
> > + * inode_ctime_peek - fetch the current ctime from the inode
> > + * @inode: inode from which to fetch ctime
> > + *
> > + * Grab the current ctime from the inode and return it.
> > + */
> > +static inline struct timespec64 inode_ctime_peek(const struct inode *inode)
> > +{
> > +   return inode->i_ctime;
> > +}
> > +
> > +/**
> > + * inode_ctime_set - set the ctime in the inode to the given value
> > + * @inode: inode in which to set the ctime
> > + * @ts: timespec value to set the ctime
> > + *
> > + * Set the ctime in @inode to @ts.
> > + */
> > +static inline struct timespec64 inode_ctime_set(struct inode *inode, 
> > struct timespec64 ts)
> > +{
> > +   inode->i_ctime = ts;
> > +   return ts;
> > +}
> > +
> > +/**
> > + * inode_ctime_set_sec - set only the tv_sec field in the inode ctime
> 
> I'm curious about why you choose to split the tv_sec and tv_nsec
> set_ functions. Do any callers not set them both? Wouldn't a
> single call enable a more atomic behavior someday?
> 
>inode_ctime_set_sec_nsec(struct inode *, time64_t, time64_t)
> 
> (or simply initialize a timespec64 and use inode_ctime_spec() )
> 

Yes, quite a few places set the fields individually. For example, when
loading a value from disk that doesn't have sufficient granularity to
set the nsecs field to anything but 0.

Could I have done it by declaring a local timespec64 variable and just
use the inode_ctime_set function in these places? Absolutely.

That's a bit more difficult to handle with coccinelle though. If someone
wants to suggest a way to do that without having to change all of these
call sites manually, then I'm open to redoing the set.

That might be better left for a later cleanup though.

> > + * @inode: inode in which to set the ctime
> > + * @sec:  value to set the tv_sec field
> > + *
> > + * Set the sec field in the ctime. Returns @sec.
> > + */
> > +static inline time64_t inode_ctime_set_sec(struct inode *inode, time64_t 
> > sec)
> > +{
> > +   inode->i_ctime.tv_sec = sec;
> > +   return sec;
> > +}
> > +
> > +/**
> > + * inode_ctime_set_nsec - set only the tv_nsec field in the inode ctime
> > + * @inode: inode in which to set the ctime
> > + * @nsec:  value to set the tv_nsec field
> > + *
> > + * Set the nsec field in the ctime. Returns @nsec.
> > + */
> > +static inline long inode_ctime_set_nsec(struct inode *inode, long nsec)
> > +{
> > +   inode->i_ctime.tv_nsec = nsec;
> > +   return nsec;
> > +}
> >   
> >   /*
> >* Snapshotting support.

-- 
Jeff Layton 



[Cluster-devel] [PATCH 79/79] fs: rename i_ctime field to __i_ctime

2023-06-21 Thread Jeff Layton
Now that everything in-tree is converted to use the accessor functions,
rename the i_ctime field in the inode to make its accesses more
self-documenting.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9afb30606373..2ca46c532b49 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -642,7 +642,7 @@ struct inode {
loff_t  i_size;
struct timespec64   i_atime;
struct timespec64   i_mtime;
-   struct timespec64   i_ctime;
+   struct timespec64   __i_ctime; /* use inode_ctime_* accessors! */
spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
unsigned short  i_bytes;
u8  i_blkbits;
@@ -1485,7 +1485,7 @@ struct timespec64 inode_ctime_set_current(struct inode 
*inode);
  */
 static inline struct timespec64 inode_ctime_peek(const struct inode *inode)
 {
-   return inode->i_ctime;
+   return inode->__i_ctime;
 }
 
 /**
@@ -1497,7 +1497,7 @@ static inline struct timespec64 inode_ctime_peek(const 
struct inode *inode)
  */
 static inline struct timespec64 inode_ctime_set(struct inode *inode, struct 
timespec64 ts)
 {
-   inode->i_ctime = ts;
+   inode->__i_ctime = ts;
return ts;
 }
 
@@ -1510,7 +1510,7 @@ static inline struct timespec64 inode_ctime_set(struct 
inode *inode, struct time
  */
 static inline time64_t inode_ctime_set_sec(struct inode *inode, time64_t sec)
 {
-   inode->i_ctime.tv_sec = sec;
+   inode->__i_ctime.tv_sec = sec;
return sec;
 }
 
@@ -1523,7 +1523,7 @@ static inline time64_t inode_ctime_set_sec(struct inode 
*inode, time64_t sec)
  */
 static inline long inode_ctime_set_nsec(struct inode *inode, long nsec)
 {
-   inode->i_ctime.tv_nsec = nsec;
+   inode->__i_ctime.tv_nsec = nsec;
return nsec;
 }
 
-- 
2.41.0



[Cluster-devel] [PATCH 34/79] gfs2: switch to new ctime accessors

2023-06-21 Thread Jeff Layton
In later patches, we're going to change how the ctime.tv_nsec field is
utilized. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.

Signed-off-by: Jeff Layton 
---
 fs/gfs2/acl.c   |  2 +-
 fs/gfs2/bmap.c  | 11 +--
 fs/gfs2/dir.c   | 15 ---
 fs/gfs2/file.c  |  2 +-
 fs/gfs2/glops.c |  4 ++--
 fs/gfs2/inode.c |  8 
 fs/gfs2/super.c |  4 ++--
 fs/gfs2/xattr.c |  8 
 8 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/fs/gfs2/acl.c b/fs/gfs2/acl.c
index a392aa0f041d..b267dae0dc63 100644
--- a/fs/gfs2/acl.c
+++ b/fs/gfs2/acl.c
@@ -142,7 +142,7 @@ int gfs2_set_acl(struct mnt_idmap *idmap, struct dentry 
*dentry,
 
ret = __gfs2_set_acl(inode, acl, type);
if (!ret && mode != inode->i_mode) {
-   inode->i_ctime = current_time(inode);
+   inode_ctime_set_current(inode);
inode->i_mode = mode;
mark_inode_dirty(inode);
}
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 8d611fbcf0bd..743b09a0b196 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1386,7 +1386,7 @@ static int trunc_start(struct inode *inode, u64 newsize)
ip->i_diskflags |= GFS2_DIF_TRUNC_IN_PROG;
 
i_size_write(inode, newsize);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   inode->i_mtime = inode_ctime_set_current(inode);
gfs2_dinode_out(ip, dibh->b_data);
 
if (journaled)
@@ -1583,8 +1583,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, 
struct gfs2_holder *rd_gh,
 
/* Every transaction boundary, we rewrite the dinode
   to keep its di_blocks current in case of failure. */
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime =
-   current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = 
inode_ctime_set_current(&ip->i_inode);
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
brelse(dibh);
@@ -1950,7 +1949,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, 
u64 length)
gfs2_statfs_change(sdp, 0, +btotal, 0);
gfs2_quota_change(ip, -(s64)btotal, ip->i_inode.i_uid,
  ip->i_inode.i_gid);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = 
current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_ctime_set_current(&ip->i_inode);
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
up_write(&ip->i_rw_mutex);
@@ -1993,7 +1992,7 @@ static int trunc_end(struct gfs2_inode *ip)
gfs2_buffer_clear_tail(dibh, sizeof(struct gfs2_dinode));
gfs2_ordered_del_inode(ip);
}
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_ctime_set_current(&ip->i_inode);
ip->i_diskflags &= ~GFS2_DIF_TRUNC_IN_PROG;
 
gfs2_trans_add_meta(ip->i_gl, dibh);
@@ -2094,7 +2093,7 @@ static int do_grow(struct inode *inode, u64 size)
goto do_end_trans;
 
truncate_setsize(inode, size);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_ctime_set_current(&ip->i_inode);
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
brelse(dibh);
diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 54a6d17b8c25..c07cb9883ea1 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -130,7 +130,7 @@ static int gfs2_dir_write_stuffed(struct gfs2_inode *ip, 
const char *buf,
memcpy(dibh->b_data + offset + sizeof(struct gfs2_dinode), buf, size);
if (ip->i_inode.i_size < offset + size)
i_size_write(&ip->i_inode, offset + size);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_ctime_set_current(&ip->i_inode);
gfs2_dinode_out(ip, dibh->b_data);
 
brelse(dibh);
@@ -227,7 +227,7 @@ static int gfs2_dir_write_data(struct gfs2_inode *ip, const 
char *buf,
 
if (ip->i_inode.i_size < offset + copied)
i_size_write(&ip->i_inode, offset + copied);
-   ip->i_inode.i_mtime = ip->i_inode.i_ctime = current_time(&ip->i_inode);
+   ip->i_inode.i_mtime = inode_ctime_set_current(&ip->i_inode);
 
gfs2_trans_add_meta(ip->i_gl, dibh);
gfs2_dinode_out(ip, dibh->b_data);
@@ -1814,7 +1814,7 @@ int gfs2_dir_add(struct inode *inode, const struct qstr 
*name,

[Cluster-devel] [PATCH 00/79] fs: new accessors for inode->i_ctime

2023-06-21 Thread Jeff Layton
I've been working on a patchset to change how the inode->i_ctime is
accessed in order to give us conditional, high-res timestamps for the
ctime and mtime. struct timespec64 has unused bits in it that we can use
to implement this. In order to do that however, we need to wrap all
accesses of inode->i_ctime to ensure that bits used as flags are
appropriately handled.

This patchset first adds some new inode_ctime_* accessor functions. It
then converts all in-tree accesses of inode->i_ctime to use those new
functions and then renames the i_ctime field to __i_ctime to help ensure
that use of the accessors.

Most of this conversion was done via coccinelle, with a few of the more
non-standard accesses done by hand. There should be no behavioral
changes with this set. That will come later, as we convert individual
filesystems to use multigrain timestamps.

Some of these patches depend on the set I sent recently to add missing
ctime updates in various subsystems:


https://lore.kernel.org/linux-fsdevel/20230612104524.17058-1-jlay...@kernel.org/T/#m25399f903cc9526e46b2e0f5a35713c80b52fde9

Since this patchset is so large, I'm only going to send individual
conversion patches to the appropriate maintainers. Please send
Acked-by's or Reviewed-by's if you can. The intent is to merge these as
a set (probably in v6.6). Let me know if that causes conflicts though,
and we can work it out.

This is based on top of linux-next as of yesterday.

Jeff Layton (79):
  fs: add ctime accessors infrastructure
  spufs: switch to new ctime accessors
  s390: switch to new ctime accessors
  binderfs: switch to new ctime accessors
  qib_fs: switch to new ctime accessors
  ibm: switch to new ctime accessors
  usb: switch to new ctime accessors
  9p: switch to new ctime accessors
  adfs: switch to new ctime accessors
  affs: switch to new ctime accessors
  afs: switch to new ctime accessors
  fs: switch to new ctime accessors
  autofs: switch to new ctime accessors
  befs: switch to new ctime accessors
  bfs: switch to new ctime accessors
  btrfs: switch to new ctime accessors
  ceph: switch to new ctime accessors
  coda: switch to new ctime accessors
  configfs: switch to new ctime accessors
  cramfs: switch to new ctime accessors
  debugfs: switch to new ctime accessors
  devpts: switch to new ctime accessors
  ecryptfs: switch to new ctime accessors
  efivarfs: switch to new ctime accessors
  efs: switch to new ctime accessors
  erofs: switch to new ctime accessors
  exfat: switch to new ctime accessors
  ext2: switch to new ctime accessors
  ext4: switch to new ctime accessors
  f2fs: switch to new ctime accessors
  fat: switch to new ctime accessors
  freevxfs: switch to new ctime accessors
  fuse: switch to new ctime accessors
  gfs2: switch to new ctime accessors
  hfs: switch to new ctime accessors
  hfsplus: switch to new ctime accessors
  hostfs: switch to new ctime accessors
  hpfs: switch to new ctime accessors
  hugetlbfs: switch to new ctime accessors
  isofs: switch to new ctime accessors
  jffs2: switch to new ctime accessors
  jfs: switch to new ctime accessors
  kernfs: switch to new ctime accessors
  minix: switch to new ctime accessors
  nfs: switch to new ctime accessors
  nfsd: switch to new ctime accessors
  nilfs2: switch to new ctime accessors
  ntfs: switch to new ctime accessors
  ntfs3: switch to new ctime accessors
  ocfs2: switch to new ctime accessors
  omfs: switch to new ctime accessors
  openpromfs: switch to new ctime accessors
  orangefs: switch to new ctime accessors
  overlayfs: switch to new ctime accessors
  proc: switch to new ctime accessors
  pstore: switch to new ctime accessors
  qnx4: switch to new ctime accessors
  qnx6: switch to new ctime accessors
  ramfs: switch to new ctime accessors
  reiserfs: switch to new ctime accessors
  romfs: switch to new ctime accessors
  smb: switch to new ctime accessors
  squashfs: switch to new ctime accessors
  sysv: switch to new ctime accessors
  tracefs: switch to new ctime accessors
  ubifs: switch to new ctime accessors
  udf: switch to new ctime accessors
  ufs: switch to new ctime accessors
  vboxsf: switch to new ctime accessors
  xfs: switch to new ctime accessors
  zonefs: switch to new ctime accessors
  mqueue: switch to new ctime accessors
  bpf: switch to new ctime accessors
  shmem: switch to new ctime accessors
  rpc_pipefs: switch to new ctime accessors
  apparmor: switch to new ctime accessors
  security: switch to new ctime accessors
  selinux: switch to new ctime accessors
  fs: rename i_ctime field to __i_ctime

 arch/powerpc/platforms/cell/spufs/inode.c |  2 +-
 arch/s390/hypfs/inode.c   |  4 +-
 drivers/android/binderfs.c|  8 +--
 drivers/infiniband/hw/qib/qib_fs.c|  4 +-
 drivers/misc/ibmasm/ibmasmfs.c|  2 +-
 drivers/misc/ibmvmc.c |  2 +-
 drivers/usb/core/devio.c  | 16 +++---
 drivers/usb/gadget/function/f_fs.c

[Cluster-devel] [PATCH 01/79] fs: add ctime accessors infrastructure

2023-06-21 Thread Jeff Layton
struct timespec64 has unused bits in the tv_nsec field that can be used
for other purposes. In future patches, we're going to change how the
inode->i_ctime is accessed in certain inodes in order to make use of
them. In order to do that safely though, we'll need to eradicate raw
accesses of the inode->i_ctime field from the kernel.

Add new accessor functions for the ctime that we can use to replace them.

Signed-off-by: Jeff Layton 
---
 fs/inode.c | 16 ++
 include/linux/fs.h | 53 +-
 2 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/fs/inode.c b/fs/inode.c
index d37fad91c8da..c005e7328fbb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode *inode)
 }
 EXPORT_SYMBOL(current_time);
 
+/**
+ * inode_ctime_set_current - set the ctime to current_time
+ * @inode: inode
+ *
+ * Set the inode->i_ctime to the current value for the inode. Returns
+ * the current value that was assigned to i_ctime.
+ */
+struct timespec64 inode_ctime_set_current(struct inode *inode)
+{
+   struct timespec64 now = current_time(inode);
+
+   inode_set_ctime(inode, now);
+   return now;
+}
+EXPORT_SYMBOL(inode_ctime_set_current);
+
 /**
  * in_group_or_capable - check whether caller is CAP_FSETID privileged
  * @idmap: idmap of the mount @inode was found from
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6867512907d6..9afb30606373 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1474,7 +1474,58 @@ static inline bool fsuidgid_has_mapping(struct 
super_block *sb,
   kgid_has_mapping(fs_userns, kgid);
 }
 
-extern struct timespec64 current_time(struct inode *inode);
+struct timespec64 current_time(struct inode *inode);
+struct timespec64 inode_ctime_set_current(struct inode *inode);
+
+/**
+ * inode_ctime_peek - fetch the current ctime from the inode
+ * @inode: inode from which to fetch ctime
+ *
+ * Grab the current ctime from the inode and return it.
+ */
+static inline struct timespec64 inode_ctime_peek(const struct inode *inode)
+{
+   return inode->i_ctime;
+}
+
+/**
+ * inode_ctime_set - set the ctime in the inode to the given value
+ * @inode: inode in which to set the ctime
+ * @ts: timespec value to set the ctime
+ *
+ * Set the ctime in @inode to @ts.
+ */
+static inline struct timespec64 inode_ctime_set(struct inode *inode, struct 
timespec64 ts)
+{
+   inode->i_ctime = ts;
+   return ts;
+}
+
+/**
+ * inode_ctime_set_sec - set only the tv_sec field in the inode ctime
+ * @inode: inode in which to set the ctime
+ * @sec:  value to set the tv_sec field
+ *
+ * Set the sec field in the ctime. Returns @sec.
+ */
+static inline time64_t inode_ctime_set_sec(struct inode *inode, time64_t sec)
+{
+   inode->i_ctime.tv_sec = sec;
+   return sec;
+}
+
+/**
+ * inode_ctime_set_nsec - set only the tv_nsec field in the inode ctime
+ * @inode: inode in which to set the ctime
+ * @nsec:  value to set the tv_nsec field
+ *
+ * Set the nsec field in the ctime. Returns @nsec.
+ */
+static inline long inode_ctime_set_nsec(struct inode *inode, long nsec)
+{
+   inode->i_ctime.tv_nsec = nsec;
+   return nsec;
+}
 
 /*
  * Snapshotting support.
-- 
2.41.0



Re: [Cluster-devel] [PATCH 7/9] gfs2: update ctime when quota is updated

2023-06-12 Thread Jeff Layton
On Fri, 2023-06-09 at 18:44 +0200, Andreas Gruenbacher wrote:
> Jeff,
> 
> On Fri, Jun 9, 2023 at 2:50 PM Jeff Layton  wrote:
> > Signed-off-by: Jeff Layton 
> > ---
> >  fs/gfs2/quota.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
> > index 1ed17226d9ed..6d283e071b90 100644
> > --- a/fs/gfs2/quota.c
> > +++ b/fs/gfs2/quota.c
> > @@ -869,7 +869,7 @@ static int gfs2_adjust_quota(struct gfs2_inode *ip, 
> > loff_t loc,
> > size = loc + sizeof(struct gfs2_quota);
> > if (size > inode->i_size)
> > i_size_write(inode, size);
> > -   inode->i_mtime = inode->i_atime = current_time(inode);
> > +   inode->i_mtime = inode->i_atime = inode->i_ctime = 
> > current_time(inode);
> 
> I don't think we need to worry about the ctime of the quota inode as
> that inode is internal to the filesystem only.
> 

Thanks Andreas.  I'll plan to drop this patch from the series for now.

Does updating the mtime and atime here serve any purpose, or should
those also be removed? If you plan to keep the a/mtime updates then I'd
still suggest updating the ctime for consistency's sake. It shouldn't
cost anything extra to do so since you're dirtying the inode below
anyway.

Thanks!

> > mark_inode_dirty(inode);
> > set_bit(QDF_REFRESH, &qd->qd_flags);
> > }
> > --
> > 2.40.1
> > 
> 
> Thanks,
> Andreas
> 

-- 
Jeff Layton 



Re: [Cluster-devel] [PATCH 0/9] fs: add some missing ctime updates

2023-06-09 Thread Jeff Layton
On Fri, 2023-06-09 at 15:10 +0200, Greg Kroah-Hartman wrote:
> On Fri, Jun 09, 2023 at 08:50:14AM -0400, Jeff Layton wrote:
> > While working on a patch series to change how we handle the ctime, I
> > found a number of places that update the mtime without a corresponding
> > ctime update. POSIX requires that when the mtime is updated that the
> > ctime also be updated.
> > 
> > Note that these are largely untested other than for compilation, so
> > please review carefully. These are a preliminary set for the upcoming
> > rework of how we handle the ctime.
> > 
> > None of these seem to be very crucial, but it would be nice if
> > various maintainers could pick these up for v6.5. Please let me know if
> > you do.
> > 
> > Jeff Layton (9):
> >   ibmvmc: update ctime in conjunction with mtime on write
> >   usb: update the ctime as well when updating mtime after an ioctl
> >   autofs: set ctime as well when mtime changes on a dir
> >   bfs: update ctime in addition to mtime when adding entries
> >   efivarfs: update ctime when mtime changes on a write
> >   exfat: ensure that ctime is updated whenever the mtime is
> >   gfs2: update ctime when quota is updated
> >   apparmor: update ctime whenever the mtime changes on an inode
> >   cifs: update the ctime on a partial page write
> > 
> >  drivers/misc/ibmvmc.c |  2 +-
> >  drivers/usb/core/devio.c  | 16 
> >  fs/autofs/root.c  |  6 +++---
> >  fs/bfs/dir.c  |  2 +-
> >  fs/efivarfs/file.c|  2 +-
> >  fs/exfat/namei.c  |  8 
> >  fs/gfs2/quota.c   |  2 +-
> >  fs/smb/client/file.c  |  2 +-
> >  security/apparmor/apparmorfs.c|  7 +--
> >  security/apparmor/policy_unpack.c | 11 +++
> >  10 files changed, 32 insertions(+), 26 deletions(-)
> > 
> > -- 
> > 2.40.1
> > 
> 
> All of these need commit log messages, didn't checkpatch warn you about
> that?

It did, once I ran it. ;)

I'll repost the set with more elaborate changelogs.
-- 
Jeff Layton 



[Cluster-devel] [PATCH 2/9] usb: update the ctime as well when updating mtime after an ioctl

2023-06-09 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 drivers/usb/core/devio.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/usb/core/devio.c b/drivers/usb/core/devio.c
index fcf68818e999..1268d313a8df 100644
--- a/drivers/usb/core/devio.c
+++ b/drivers/usb/core/devio.c
@@ -2640,21 +2640,21 @@ static long usbdev_do_ioctl(struct file *file, unsigned 
int cmd,
snoop(&dev->dev, "%s: CONTROL\n", __func__);
ret = proc_control(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_BULK:
snoop(&dev->dev, "%s: BULK\n", __func__);
ret = proc_bulk(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_RESETEP:
snoop(&dev->dev, "%s: RESETEP\n", __func__);
ret = proc_resetep(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_RESET:
@@ -2666,7 +2666,7 @@ static long usbdev_do_ioctl(struct file *file, unsigned 
int cmd,
snoop(&dev->dev, "%s: CLEAR_HALT\n", __func__);
ret = proc_clearhalt(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_GETDRIVER:
@@ -2693,7 +2693,7 @@ static long usbdev_do_ioctl(struct file *file, unsigned 
int cmd,
snoop(&dev->dev, "%s: SUBMITURB\n", __func__);
ret = proc_submiturb(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
 #ifdef CONFIG_COMPAT
@@ -2701,14 +2701,14 @@ static long usbdev_do_ioctl(struct file *file, unsigned 
int cmd,
snoop(&dev->dev, "%s: CONTROL32\n", __func__);
ret = proc_control_compat(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_BULK32:
snoop(&dev->dev, "%s: BULK32\n", __func__);
ret = proc_bulk_compat(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_DISCSIGNAL32:
@@ -2720,7 +2720,7 @@ static long usbdev_do_ioctl(struct file *file, unsigned 
int cmd,
snoop(&dev->dev, "%s: SUBMITURB32\n", __func__);
ret = proc_submiturb_compat(ps, p);
if (ret >= 0)
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
break;
 
case USBDEVFS_IOCTL32:
-- 
2.40.1



[Cluster-devel] [PATCH 6/9] exfat: ensure that ctime is updated whenever the mtime is

2023-06-09 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/exfat/namei.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index e0ff9d156f6f..d9b46fa36bff 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -817,7 +817,7 @@ static int exfat_unlink(struct inode *dir, struct dentry 
*dentry)
ei->dir.dir = DIR_DELETED;
 
inode_inc_iversion(dir);
-   dir->i_mtime = dir->i_atime = current_time(dir);
+   dir->i_mtime = dir->i_atime = dir->i_ctime = current_time(dir);
exfat_truncate_atime(&dir->i_atime);
if (IS_DIRSYNC(dir))
exfat_sync_inode(dir);
@@ -825,7 +825,7 @@ static int exfat_unlink(struct inode *dir, struct dentry 
*dentry)
mark_inode_dirty(dir);
 
clear_nlink(inode);
-   inode->i_mtime = inode->i_atime = current_time(inode);
+   inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
exfat_truncate_atime(&inode->i_atime);
exfat_unhash_inode(inode);
exfat_d_version_set(dentry, inode_query_iversion(dir));
@@ -979,7 +979,7 @@ static int exfat_rmdir(struct inode *dir, struct dentry 
*dentry)
ei->dir.dir = DIR_DELETED;
 
inode_inc_iversion(dir);
-   dir->i_mtime = dir->i_atime = current_time(dir);
+   dir->i_mtime = dir->i_atime = dir->i_ctime = current_time(dir);
exfat_truncate_atime(&dir->i_atime);
if (IS_DIRSYNC(dir))
exfat_sync_inode(dir);
@@ -988,7 +988,7 @@ static int exfat_rmdir(struct inode *dir, struct dentry 
*dentry)
drop_nlink(dir);
 
clear_nlink(inode);
-   inode->i_mtime = inode->i_atime = current_time(inode);
+   inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
exfat_truncate_atime(&inode->i_atime);
exfat_unhash_inode(inode);
exfat_d_version_set(dentry, inode_query_iversion(dir));
-- 
2.40.1



[Cluster-devel] [PATCH 8/9] apparmor: update ctime whenever the mtime changes on an inode

2023-06-09 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 security/apparmor/apparmorfs.c|  7 +--
 security/apparmor/policy_unpack.c | 11 +++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index db7a51acf9db..c06053718836 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -1554,8 +1554,11 @@ void __aafs_profile_migrate_dents(struct aa_profile *old,
 
for (i = 0; i < AAFS_PROF_SIZEOF; i++) {
new->dents[i] = old->dents[i];
-   if (new->dents[i])
-   new->dents[i]->d_inode->i_mtime = 
current_time(new->dents[i]->d_inode);
+   if (new->dents[i]) {
+   struct inode *inode = d_inode(new->dents[i]);
+
+   inode->i_mtime = inode->i_ctime = current_time(inode);
+   }
old->dents[i] = NULL;
}
 }
diff --git a/security/apparmor/policy_unpack.c 
b/security/apparmor/policy_unpack.c
index cf2ceec40b28..48a97c1800b9 100644
--- a/security/apparmor/policy_unpack.c
+++ b/security/apparmor/policy_unpack.c
@@ -86,10 +86,13 @@ void __aa_loaddata_update(struct aa_loaddata *data, long 
revision)
 
data->revision = revision;
if ((data->dents[AAFS_LOADDATA_REVISION])) {
-   d_inode(data->dents[AAFS_LOADDATA_DIR])->i_mtime =
-   current_time(d_inode(data->dents[AAFS_LOADDATA_DIR]));
-   d_inode(data->dents[AAFS_LOADDATA_REVISION])->i_mtime =
-   
current_time(d_inode(data->dents[AAFS_LOADDATA_REVISION]));
+   struct inode *inode;
+
+   inode = d_inode(data->dents[AAFS_LOADDATA_DIR]);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
+
+   inode = d_inode(data->dents[AAFS_LOADDATA_REVISION]);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
}
 }
 
-- 
2.40.1



[Cluster-devel] [PATCH 9/9] cifs: update the ctime on a partial page write

2023-06-09 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/smb/client/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/smb/client/file.c b/fs/smb/client/file.c
index df88b8c04d03..a00038a326cf 100644
--- a/fs/smb/client/file.c
+++ b/fs/smb/client/file.c
@@ -2596,7 +2596,7 @@ static int cifs_partialpagewrite(struct page *page, 
unsigned from, unsigned to)
   write_data, to - from, &offset);
cifsFileInfo_put(open_file);
/* Does mm or vfs already set times? */
-   inode->i_atime = inode->i_mtime = current_time(inode);
+   inode->i_atime = inode->i_mtime = inode->i_ctime = 
current_time(inode);
if ((bytes_written > 0) && (offset))
rc = 0;
else if (bytes_written < 0)
-- 
2.40.1



[Cluster-devel] [PATCH 5/9] efivarfs: update ctime when mtime changes on a write

2023-06-09 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 fs/efivarfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/efivarfs/file.c b/fs/efivarfs/file.c
index d57ee15874f9..375576111dc3 100644
--- a/fs/efivarfs/file.c
+++ b/fs/efivarfs/file.c
@@ -51,7 +51,7 @@ static ssize_t efivarfs_file_write(struct file *file,
} else {
inode_lock(inode);
i_size_write(inode, datasize + sizeof(attributes));
-   inode->i_mtime = current_time(inode);
+   inode->i_mtime = inode->i_ctime = current_time(inode);
inode_unlock(inode);
}
 
-- 
2.40.1



  1   2   3   4   >