from:"Darrick J. Wong"

Re: [PATCH v6 3/9] fs: add percpu counters for significant multigrain timestamp events

2024-07-15 Thread Darrick J. Wong

On Mon, Jul 15, 2024 at 03:53:42PM -0400, Jeff Layton wrote:
> On Mon, 2024-07-15 at 11:32 -0700, Darrick J. Wong wrote:
> > On Mon, Jul 15, 2024 at 08:48:54AM -0400, Jeff Layton wrote:
> > > Four percpu counters for counting various stats around mgtimes, and
> > > a
> > > new debugfs file for displaying them:
> > > 
> > > - number of attempted ctime updates
> > > - number of successful i_ctime_nsec swaps
> > > - number of fine-grained timestamp fetches
> > > - number of floor value swaps
> > > 
> > > Reviewed-by: Josef Bacik 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/inode.c | 70
> > > +-
> > >  1 file changed, 69 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 869994285e87..fff844345c35 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -21,6 +21,8 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > > +#include 
> > >  #include 
> > >  #define CREATE_TRACE_POINTS
> > >  #include 
> > > @@ -80,6 +82,10 @@ EXPORT_SYMBOL(empty_aops);
> > >  
> > >  static DEFINE_PER_CPU(unsigned long, nr_inodes);
> > >  static DEFINE_PER_CPU(unsigned long, nr_unused);
> > > +static DEFINE_PER_CPU(unsigned long, mg_ctime_updates);
> > > +static DEFINE_PER_CPU(unsigned long, mg_fine_stamps);
> > > +static DEFINE_PER_CPU(unsigned long, mg_floor_swaps);
> > > +static DEFINE_PER_CPU(unsigned long, mg_ctime_swaps);
> > 
> > Should this all get switched off if CONFIG_DEBUG_FS=n?
> > 
> > --D
> > 
> 
> Sure, why not. That's simple enough to do.
> 
> I pushed an updated mgtime branch to my git tree. Here's the updated
> patch that's the only difference:
> 
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=mgtime&id=ee7fe6e9c0598754861c8620230f15f3de538ca5
> 
> Seems to build OK both with and without CONFIG_DEBUG_FS.

LGTM,
Reviewed-by: Darrick J. Wong 

Thank you for your work on all this multigrain stuff. :)

--D

>  
> > >  
> > >  static struct kmem_cache *inode_cachep __ro_after_init;
> > >  
> > > @@ -101,6 +107,42 @@ static inline long get_nr_inodes_unused(void)
> > >   return sum < 0 ? 0 : sum;
> > >  }
> > >  
> > > +static long get_mg_ctime_updates(void)
> > > +{
> > > + int i;
> > > + long sum = 0;
> > > + for_each_possible_cpu(i)
> > > + sum += per_cpu(mg_ctime_updates, i);
> > > + return sum < 0 ? 0 : sum;
> > > +}
> > > +
> > > +static long get_mg_fine_stamps(void)
> > > +{
> > > + int i;
> > > + long sum = 0;
> > > + for_each_possible_cpu(i)
> > > + sum += per_cpu(mg_fine_stamps, i);
> > > + return sum < 0 ? 0 : sum;
> > > +}
> > > +
> > > +static long get_mg_floor_swaps(void)
> > > +{
> > > + int i;
> > > + long sum = 0;
> > > + for_each_possible_cpu(i)
> > > + sum += per_cpu(mg_floor_swaps, i);
> > > + return sum < 0 ? 0 : sum;
> > > +}
> > > +
> > > +static long get_mg_ctime_swaps(void)
> > > +{
> > > + int i;
> > > + long sum = 0;
> > > + for_each_possible_cpu(i)
> > > + sum += per_cpu(mg_ctime_swaps, i);
> > > + return sum < 0 ? 0 : sum;
> > > +}
> > > +
> > >  long get_nr_dirty_inodes(void)
> > >  {
> > >   /* not actually dirty inodes, but a wild approximation */
> > > @@ -2655,6 +2697,7 @@ struct timespec64
> > > inode_set_ctime_current(struct inode *inode)
> > >  
> > >   /* Get a fine-grained time */
> > >   fine = ktime_get();
> > > + this_cpu_inc(mg_fine_stamps);
> > >  
> > >   /*
> > >    * If the cmpxchg works, we take the new
> > > floor value. If
> > > @@ -2663,11 +2706,14 @@ struct timespec64
> > > inode_set_ctime_current(struct inode *inode)
> > >    * as good, so keep it.
> > >    */
> > >   old = floor;
> > > - if (!atomic64_try_cmpxchg(&ctime_floor,
> > > &old, fine))
> > > + if (atomic64_try_cmpxchg(&ctime_floor,
&

Re: [PATCH v6 6/9] xfs: switch to multigrain timestamps

2024-07-15 Thread Darrick J. Wong

On Mon, Jul 15, 2024 at 08:48:57AM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> always set.
> 
> Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> should give us better semantics now.
> 
> Reviewed-by: Josef Bacik 
> Signed-off-by: Jeff Layton 

Looks good,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
>  fs/xfs/xfs_iops.c   | 10 +++---
>  fs/xfs/xfs_super.c  |  2 +-
>  3 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 69fc5b981352..1f3639bbf5f0 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + ASSERT(flags & XFS_ICHGTIME_CHG);
>  
> + tv = inode_set_ctime_current(inode);
>   if (flags & XFS_ICHGTIME_MOD)
>   inode_set_mtime_to_ts(inode, tv);
> - if (flags & XFS_ICHGTIME_CHG)
> - inode_set_ctime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CREATE)
>   ip->i_crtime = tv;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a00dcbc77e12..d25872f818fa 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -592,8 +592,9 @@ xfs_vn_getattr(
>   stat->gid = vfsgid_into_kgid(vfsgid);
>   stat->ino = ip->i_ino;
>   stat->atime = inode_get_atime(inode);
> - stat->mtime = inode_get_mtime(inode);
> - stat->ctime = inode_get_ctime(inode);
> +
> + fill_mg_cmtime(stat, request_mask, inode);
> +
>   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
>  
>   if (xfs_has_v3inodes(mp)) {
> @@ -603,11 +604,6 @@ xfs_vn_getattr(
>   }
>   }
>  
> - if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> - stat->change_cookie = inode_query_iversion(inode);
> - stat->result_mask |= STATX_CHANGE_COOKIE;
> - }
> -
>   /*
>* Note: If you add another clause to set an attribute flag, please
>* update attributes_mask below.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 27e9f749c4c7..210481b03fdb 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
>   .init_fs_context= xfs_init_fs_context,
>   .parameters = xfs_fs_parameters,
>   .kill_sb= xfs_kill_sb,
> - .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> + .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
>  };
>  MODULE_ALIAS_FS("xfs");
>  
> 
> -- 
> 2.45.2
> 
>

Re: [PATCH v6 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-15 Thread Darrick J. Wong

On Mon, Jul 15, 2024 at 08:48:56AM -0400, Jeff Layton wrote:
> Add a high-level document that describes how multigrain timestamps work,
> rationale for them, and some info about implementation and tradeoffs.
> 
> Reviewed-by: Josef Bacik 
> Signed-off-by: Jeff Layton 

Seems fine to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  Documentation/filesystems/multigrain-ts.rst | 120 
> 
>  1 file changed, 120 insertions(+)
> 
> diff --git a/Documentation/filesystems/multigrain-ts.rst 
> b/Documentation/filesystems/multigrain-ts.rst
> new file mode 100644
> index ..5cefc204ecec
> --- /dev/null
> +++ b/Documentation/filesystems/multigrain-ts.rst
> @@ -0,0 +1,120 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=
> +Multigrain Timestamps
> +=
> +
> +Introduction
> +
> +Historically, the kernel has always used coarse time values to stamp
> +inodes. This value is updated on every jiffy, so any change that happens
> +within that jiffy will end up with the same timestamp.
> +
> +When the kernel goes to stamp an inode (due to a read or write), it first 
> gets
> +the current time and then compares it to the existing timestamp(s) to see
> +whether anything will change. If nothing changed, then it can avoid updating
> +the inode's metadata.
> +
> +Coarse timestamps are therefore good from a performance standpoint, since 
> they
> +reduce the need for metadata updates, but bad from the standpoint of
> +determining whether anything has changed, since a lot of things can happen 
> in a
> +jiffy.
> +
> +They are particularly troublesome with NFSv3, where unchanging timestamps can
> +make it difficult to tell whether to invalidate caches. NFSv4 provides a
> +dedicated change attribute that should always show a visible change, but not
> +all filesystems implement this properly, causing the NFS server to substitute
> +the ctime in many cases.
> +
> +Multigrain timestamps aim to remedy this by selectively using fine-grained
> +timestamps when a file has had its timestamps queried recently, and the 
> current
> +coarse-grained time does not cause a change.
> +
> +Inode Timestamps
> +
> +There are currently 3 timestamps in the inode that are updated to the current
> +wallclock time on different activity:
> +
> +ctime:
> +  The inode change time. This is stamped with the current time whenever
> +  the inode's metadata is changed. Note that this value is not settable
> +  from userland.
> +
> +mtime:
> +  The inode modification time. This is stamped with the current time
> +  any time a file's contents change.
> +
> +atime:
> +  The inode access time. This is stamped whenever an inode's contents are
> +  read. Widely considered to be a terrible mistake. Usually avoided with
> +  options like noatime or relatime.
> +
> +Updating the mtime always implies a change to the ctime, but updating the
> +atime due to a read request does not.
> +
> +Multigrain timestamps are only tracked for the ctime and the mtime. atimes 
> are
> +not affected and always use the coarse-grained value (subject to the floor).
> +
> +Inode Timestamp Ordering
> +
> +
> +In addition to just providing info about changes to individual files, file
> +timestamps also serve an important purpose in applications like "make". These
> +programs measure timestamps in order to determine whether source files might 
> be
> +newer than cached objects.
> +
> +Userland applications like make can only determine ordering based on
> +operational boundaries. For a syscall those are the syscall entry and exit
> +points. For io_uring or nfsd operations, that's the request submission and
> +response. In the case of concurrent operations, userland can make no
> +determination about the order in which things will occur.
> +
> +For instance, if a single thread modifies one file, and then another file in
> +sequence, the second file must show an equal or later mtime than the first. 
> The
> +same is true if two threads are issuing similar operations that do not 
> overlap
> +in time.
> +
> +If however, two threads have racing syscalls that overlap in time, then there
> +is no such guarantee, and the second file may appear to have been modified
> +before, after or at the same time as the first, regardless of which one was
> +submitted first.
> +
> +Multigrain Timestamps
> +=
> +Multigrain timestamps are aimed at ensuring that changes to a single file are
> +always recognizable, without violating the ordering guarantees when multiple
> +different files are modified. This

Re: [PATCH v6 3/9] fs: add percpu counters for significant multigrain timestamp events

2024-07-15 Thread Darrick J. Wong

On Mon, Jul 15, 2024 at 08:48:54AM -0400, Jeff Layton wrote:
> Four percpu counters for counting various stats around mgtimes, and a
> new debugfs file for displaying them:
> 
> - number of attempted ctime updates
> - number of successful i_ctime_nsec swaps
> - number of fine-grained timestamp fetches
> - number of floor value swaps
> 
> Reviewed-by: Josef Bacik 
> Signed-off-by: Jeff Layton 
> ---
>  fs/inode.c | 70 
> +-
>  1 file changed, 69 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 869994285e87..fff844345c35 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -21,6 +21,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include 
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -80,6 +82,10 @@ EXPORT_SYMBOL(empty_aops);
>  
>  static DEFINE_PER_CPU(unsigned long, nr_inodes);
>  static DEFINE_PER_CPU(unsigned long, nr_unused);
> +static DEFINE_PER_CPU(unsigned long, mg_ctime_updates);
> +static DEFINE_PER_CPU(unsigned long, mg_fine_stamps);
> +static DEFINE_PER_CPU(unsigned long, mg_floor_swaps);
> +static DEFINE_PER_CPU(unsigned long, mg_ctime_swaps);

Should this all get switched off if CONFIG_DEBUG_FS=n?

--D

>  
>  static struct kmem_cache *inode_cachep __ro_after_init;
>  
> @@ -101,6 +107,42 @@ static inline long get_nr_inodes_unused(void)
>   return sum < 0 ? 0 : sum;
>  }
>  
> +static long get_mg_ctime_updates(void)
> +{
> + int i;
> + long sum = 0;
> + for_each_possible_cpu(i)
> + sum += per_cpu(mg_ctime_updates, i);
> + return sum < 0 ? 0 : sum;
> +}
> +
> +static long get_mg_fine_stamps(void)
> +{
> + int i;
> + long sum = 0;
> + for_each_possible_cpu(i)
> + sum += per_cpu(mg_fine_stamps, i);
> + return sum < 0 ? 0 : sum;
> +}
> +
> +static long get_mg_floor_swaps(void)
> +{
> + int i;
> + long sum = 0;
> + for_each_possible_cpu(i)
> + sum += per_cpu(mg_floor_swaps, i);
> + return sum < 0 ? 0 : sum;
> +}
> +
> +static long get_mg_ctime_swaps(void)
> +{
> + int i;
> + long sum = 0;
> + for_each_possible_cpu(i)
> + sum += per_cpu(mg_ctime_swaps, i);
> + return sum < 0 ? 0 : sum;
> +}
> +
>  long get_nr_dirty_inodes(void)
>  {
>   /* not actually dirty inodes, but a wild approximation */
> @@ -2655,6 +2697,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
> *inode)
>  
>   /* Get a fine-grained time */
>   fine = ktime_get();
> + this_cpu_inc(mg_fine_stamps);
>  
>   /*
>* If the cmpxchg works, we take the new floor value. If
> @@ -2663,11 +2706,14 @@ struct timespec64 inode_set_ctime_current(struct 
> inode *inode)
>* as good, so keep it.
>*/
>   old = floor;
> - if (!atomic64_try_cmpxchg(&ctime_floor, &old, fine))
> + if (atomic64_try_cmpxchg(&ctime_floor, &old, fine))
> + this_cpu_inc(mg_floor_swaps);
> + else
>   fine = old;
>   now = ktime_mono_to_real(fine);
>   }
>   }
> + this_cpu_inc(mg_ctime_updates);
>   now_ts = timestamp_truncate(ktime_to_timespec64(now), inode);
>   cur = cns;
>  
> @@ -2682,6 +2728,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
> *inode)
>   /* If swap occurred, then we're (mostly) done */
>   inode->i_ctime_sec = now_ts.tv_sec;
>   trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
> + this_cpu_inc(mg_ctime_swaps);
>   } else {
>   /*
>* Was the change due to someone marking the old ctime QUERIED?
> @@ -2751,3 +2798,24 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
>   return mode & ~S_ISGID;
>  }
>  EXPORT_SYMBOL(mode_strip_sgid);
> +
> +static int mgts_show(struct seq_file *s, void *p)
> +{
> + long ctime_updates = get_mg_ctime_updates();
> + long ctime_swaps = get_mg_ctime_swaps();
> + long fine_stamps = get_mg_fine_stamps();
> + long floor_swaps = get_mg_floor_swaps();
> +
> + seq_printf(s, "%lu %lu %lu %lu\n",
> +ctime_updates, ctime_swaps, fine_stamps, floor_swaps);
> + return 0;
> +}
> +
> +DEFINE_SHOW_ATTRIBUTE(mgts);
> +
> +static int __init mg_debugfs_init(void)
> +{
> + debugfs_create_file("multigrain_timestamps", S_IFREG | S_IRUGO, NULL, 
> NULL, &mgts_fops);
> + return 0;
> +}
> +late_initcall(mg_debugfs_init);
> 
> -- 
> 2.45.2
> 
>

Re: [PATCH v6 2/9] fs: tracepoints around multigrain timestamp events

2024-07-15 Thread Darrick J. Wong

On Mon, Jul 15, 2024 at 08:48:53AM -0400, Jeff Layton wrote:
> Add some tracepoints around various multigrain timestamp events.
> 
> Reviewed-by: Josef Bacik 
> Signed-off-by: Jeff Layton 

Woot!
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/inode.c   |   9 ++-
>  fs/stat.c|   3 +
>  include/trace/events/timestamp.h | 124 
> +++
>  3 files changed, 135 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 417acbeabef3..869994285e87 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -22,6 +22,9 @@
>  #include 
>  #include 
>  #include 
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  #include "internal.h"
>  
>  /*
> @@ -2569,6 +2572,7 @@ EXPORT_SYMBOL(inode_nohighmem);
>  
>  struct timespec64 inode_set_ctime_to_ts(struct inode *inode, struct 
> timespec64 ts)
>  {
> + trace_inode_set_ctime_to_ts(inode, &ts);
>   set_normalized_timespec64(&ts, ts.tv_sec, ts.tv_nsec);
>   inode->i_ctime_sec = ts.tv_sec;
>   inode->i_ctime_nsec = ts.tv_nsec;
> @@ -2668,13 +2672,16 @@ struct timespec64 inode_set_ctime_current(struct 
> inode *inode)
>   cur = cns;
>  
>   /* No need to cmpxchg if it's exactly the same */
> - if (cns == now_ts.tv_nsec && inode->i_ctime_sec == now_ts.tv_sec)
> + if (cns == now_ts.tv_nsec && inode->i_ctime_sec == now_ts.tv_sec) {
> + trace_ctime_xchg_skip(inode, &now_ts);
>   goto out;
> + }
>  retry:
>   /* Try to swap the nsec value into place. */
>   if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now_ts.tv_nsec)) {
>   /* If swap occurred, then we're (mostly) done */
>   inode->i_ctime_sec = now_ts.tv_sec;
> + trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
>   } else {
>   /*
>* Was the change due to someone marking the old ctime QUERIED?
> diff --git a/fs/stat.c b/fs/stat.c
> index df7fdd3afed9..552dfd67688b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -23,6 +23,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #include "internal.h"
>  #include "mount.h"
>  
> @@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32 request_mask, 
> struct inode *inode)
>   stat->mtime = inode_get_mtime(inode);
>   stat->ctime.tv_sec = inode->i_ctime_sec;
>   stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & 
> ~I_CTIME_QUERIED;
> + trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
>  }
>  EXPORT_SYMBOL(fill_mg_cmtime);
>  
> diff --git a/include/trace/events/timestamp.h 
> b/include/trace/events/timestamp.h
> new file mode 100644
> index ..c9e5ec930054
> --- /dev/null
> +++ b/include/trace/events/timestamp.h
> @@ -0,0 +1,124 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM timestamp
> +
> +#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_TIMESTAMP_H
> +
> +#include 
> +#include 
> +
> +#define CTIME_QUERIED_FLAGS \
> + { I_CTIME_QUERIED, "Q" }
> +
> +DECLARE_EVENT_CLASS(ctime,
> + TP_PROTO(struct inode *inode,
> +  struct timespec64 *ctime),
> +
> + TP_ARGS(inode, ctime),
> +
> + TP_STRUCT__entry(
> + __field(dev_t,  dev)
> + __field(ino_t,  ino)
> + __field(time64_t,   ctime_s)
> + __field(u32,ctime_ns)
> + __field(u32,gen)
> + ),
> +
> + TP_fast_assign(
> + __entry->dev= inode->i_sb->s_dev;
> + __entry->ino= inode->i_ino;
> + __entry->gen= inode->i_generation;
> + __entry->ctime_s= ctime->tv_sec;
> + __entry->ctime_ns   = ctime->tv_nsec;
> + ),
> +
> + TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
> + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
> __entry->gen,
> + __entry->ctime_s, __entry->ctime_ns
> + )
> +);
> +
> +DEFINE_EVENT(ctime, inode_set_ctime_to_ts,
> + TP_PROTO(struct inode *inode,
> +  struct timespec64 *ctime),
> + TP_ARGS(inode, ctime));
> +
> +DEFINE_EVENT(ctime, ctime_xchg_skip,
> + TP_PROTO(struct inode *inode,
> +  struct timespec64 *ctime),
> +

Re: [PATCH v5 6/9] xfs: switch to multigrain timestamps

2024-07-11 Thread Darrick J. Wong

On Thu, Jul 11, 2024 at 11:58:59AM -0400, Jeff Layton wrote:
> On Thu, 2024-07-11 at 08:09 -0700, Darrick J. Wong wrote:
> > On Thu, Jul 11, 2024 at 07:08:10AM -0400, Jeff Layton wrote:
> > > Enable multigrain timestamps, which should ensure that there is an
> > > apparent change to the timestamp whenever it has been written after
> > > being actively observed via getattr.
> > > 
> > > Also, anytime the mtime changes, the ctime must also change, and those
> > > are now the only two options for xfs_trans_ichgtime. Have that function
> > > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > > always set.
> > > 
> > > Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> > > should give us better semantics now.
> > 
> > Following up on "As long as the fs isn't touching i_ctime_nsec directly,
> > you shouldn't need to worry about this" from:
> > https://lore.kernel.org/linux-xfs/cae5c28f172ac57b7eaaa98a00b23f342f01ba64.ca...@kernel.org/
> > 
> > xfs /does/ touch i_ctime_nsec directly when it's writing inodes to disk.
> > From xfs_inode_to_disk, see:
> > 
> > to->di_ctime = xfs_inode_to_disk_ts(ip, inode_get_ctime(inode));
> > 
> > AFAICT, inode_get_ctime itself remains unchanged, and still returns
> > inode->__i_ctime, right?  In which case it's returning a raw timespec64,
> > which can include the QUERIED flag in tv_nsec, right?
> > 
> 
> No, in the first patch in the series, inode_get_ctime becomes this:
> 
> #define I_CTIME_QUERIED ((u32)BIT(31))
> 
> static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> {
> return inode->i_ctime_sec;
> }
> 
> static inline long inode_get_ctime_nsec(const struct inode *inode)
> {
> return inode->i_ctime_nsec & ~I_CTIME_QUERIED;
> }
> 
> static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> {
> struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
>  .tv_nsec = inode_get_ctime_nsec(inode) };
> 
> return ts;
> }

Doh!  I forgot that this has already been soaking in the vfs tree:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/include/linux/fs.h?h=next-20240711&id=3aa63a569c64e708df547a8913c84e64a06e7853

> ...which should ensure that you never store the QUERIED bit.

So yep, we're fine here.  Sorry about the noise; this was the very
subtle clue in the diff that the change had already been applied:

 static inline struct timespec64 inode_get_ctime(const struct inode *inode)
@@ -1626,13 +1637,7 @@ static inline struct timespec64 inode_get_ctime(const 
struct inode *inode)
return ts;
 }

(Doh doh doh doh doh...)

> > Now let's look at the consumer:
> > 
> > static inline xfs_timestamp_t
> > xfs_inode_to_disk_ts(
> > struct xfs_inode*ip,
> > const struct timespec64 tv)
> > {
> > struct xfs_legacy_timestamp *lts;
> > xfs_timestamp_t ts;
> > 
> > if (xfs_inode_has_bigtime(ip))
> > return cpu_to_be64(xfs_inode_encode_bigtime(tv));
> > 
> > lts = (struct xfs_legacy_timestamp *)&ts;
> > lts->t_sec = cpu_to_be32(tv.tv_sec);
> > lts->t_nsec = cpu_to_be32(tv.tv_nsec);
> > 
> > return ts;
> > }
> > 
> > For the !bigtime case (aka before we added y2038 support) the queried
> > flag gets encoded into the tv_nsec field since xfs doesn't filter the
> > queried flag.
> > 
> > For the bigtime case, the timespec is turned into an absolute nsec count
> > since the xfs epoch (which is the minimum timestamp possible under the
> > old encoding scheme):
> > 
> > static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
> > {
> > return xfs_unix_to_bigtime(tv.tv_sec) * NSEC_PER_SEC + tv.tv_nsec;
> > }
> > 
> > Here we'd also be mixing in the QUERIED flag, only now we've encoded a
> > time that's a second in the future.  I think the solution is to add a:
> > 
> > static inline struct timespec64
> > inode_peek_ctime(const struct inode *inode)
> > {
> > return (struct timespec64){
> > .tv_sec = inode->__i_ctime.tv_sec,
> > .tv_nsec = inode->__i_ctime.tv_nsec & ~I_CTIME_QUERIED,
> > };
> > }
> > 
> > similar to what inode_peek_iversion does for iversion; and then
> > xfs_inode_to_disk can do:
> > 
> > to->di_ct

Re: [PATCH v5 5/9] Documentation: add a new file documenting multigrain timestamps

2024-07-11 Thread Darrick J. Wong

On Thu, Jul 11, 2024 at 07:08:09AM -0400, Jeff Layton wrote:
> Add a high-level document that describes how multigrain timestamps work,
> rationale for them, and some info about implementation and tradeoffs.
> 
> Signed-off-by: Jeff Layton 
> ---
>  Documentation/filesystems/multigrain-ts.rst | 120 
> 
>  1 file changed, 120 insertions(+)
> 
> diff --git a/Documentation/filesystems/multigrain-ts.rst 
> b/Documentation/filesystems/multigrain-ts.rst
> new file mode 100644
> index ..5cefc204ecec
> --- /dev/null
> +++ b/Documentation/filesystems/multigrain-ts.rst
> @@ -0,0 +1,120 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=
> +Multigrain Timestamps
> +=
> +
> +Introduction
> +
> +Historically, the kernel has always used coarse time values to stamp
> +inodes. This value is updated on every jiffy, so any change that happens
> +within that jiffy will end up with the same timestamp.
> +
> +When the kernel goes to stamp an inode (due to a read or write), it first 
> gets
> +the current time and then compares it to the existing timestamp(s) to see
> +whether anything will change. If nothing changed, then it can avoid updating
> +the inode's metadata.
> +
> +Coarse timestamps are therefore good from a performance standpoint, since 
> they
> +reduce the need for metadata updates, but bad from the standpoint of
> +determining whether anything has changed, since a lot of things can happen 
> in a
> +jiffy.
> +
> +They are particularly troublesome with NFSv3, where unchanging timestamps can
> +make it difficult to tell whether to invalidate caches. NFSv4 provides a
> +dedicated change attribute that should always show a visible change, but not
> +all filesystems implement this properly, causing the NFS server to substitute
> +the ctime in many cases.
> +
> +Multigrain timestamps aim to remedy this by selectively using fine-grained
> +timestamps when a file has had its timestamps queried recently, and the 
> current
> +coarse-grained time does not cause a change.
> +
> +Inode Timestamps
> +
> +There are currently 3 timestamps in the inode that are updated to the current
> +wallclock time on different activity:
> +
> +ctime:
> +  The inode change time. This is stamped with the current time whenever
> +  the inode's metadata is changed. Note that this value is not settable
> +  from userland.
> +
> +mtime:
> +  The inode modification time. This is stamped with the current time
> +  any time a file's contents change.
> +
> +atime:
> +  The inode access time. This is stamped whenever an inode's contents are
> +  read. Widely considered to be a terrible mistake. Usually avoided with
> +  options like noatime or relatime.

And for btime/crtime (aka creation time) a filesystem can take the
coarse timestamp, right?  It's not settable by userspace, and I think
statx is the only way those are ever exposed.  QUERIED is never set when
the file is being created.

> +Updating the mtime always implies a change to the ctime, but updating the
> +atime due to a read request does not.
> +
> +Multigrain timestamps are only tracked for the ctime and the mtime. atimes 
> are
> +not affected and always use the coarse-grained value (subject to the floor).

Is it ok if an atime update uses the same timespec as was used for a
ctime update?  There's a pending update for 6.11 that changes
xfs_trans_ichgtime to do:

tv = current_time(inode);

if (flags & XFS_ICHGTIME_MOD)
inode_set_mtime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CHG)
inode_set_ctime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_ACCESS)
inode_set_atime_to_ts(inode, tv);
if (flags & XFS_ICHGTIME_CREATE)
ip->i_crtime = tv;

So I guess xfs could do something like this to set @tv:

if (flags & XFS_ICHGTIME_CHG)
tv = inode_set_ctime_current(inode);
else
tv = current_time();
...
if (flags & XFS_ICHGTIME_ACCESS)
inode_set_atime_to_ts(inode, tv);

Thoughts?

> +Inode Timestamp Ordering
> +
> +
> +In addition to just providing info about changes to individual files, file
> +timestamps also serve an important purpose in applications like "make". These
> +programs measure timestamps in order to determine whether source files might 
> be
> +newer than cached objects.
> +
> +Userland applications like make can only determine ordering based on
> +operational boundaries. For a syscall those are the syscall entry and exit
> +points. For io_uring or nfsd operations, that's the request submission and
> +response. In the case of concurrent operations, userland can make no
> +determination about the order in which things will occur.
> +
> +For instance, if a single thread modifies one file, and then another file in
> +sequence, the second file must show an equal or later mtime than the first. 
> Th

Re: [PATCH v5 1/9] fs: add infrastructure for multigrain timestamps

2024-07-11 Thread Darrick J. Wong

On Thu, Jul 11, 2024 at 07:08:05AM -0400, Jeff Layton wrote:
> The VFS has always used coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide when to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
> 
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
> 
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> as a flag that indicates whether the current timestamps have been
> queried via stat() or the like. When it's set, we allow the kernel to
> use a fine-grained timestamp iff it's necessary to make the ctime show
> a different value.
> 
> This solves the problem of being able to distinguish the timestamp
> between updates, but introduces a new problem: it's now possible for a
> file being changed to get a fine-grained timestamp. A file that is
> altered just a bit later can then get a coarse-grained one that appears
> older than the earlier fine-grained time. This violates timestamp
> ordering guarantees.
> 
> To remedy this, keep a global monotonic ktime_t value that acts as a

It's an atomic64_t now, right?

> timestamp floor.  When we go to stamp a file, we first get the latter of
> the current floor value and the current coarse-grained time. If the
> inode ctime hasn't been queried then we just attempt to stamp it with
> that value.
> 
> If it has been queried, then first see whether the current coarse time
> is later than the existing ctime. If it is, then we accept that value.
> If it isn't, then we get a fine-grained time and try to swap that into
> the global floor. Whether that succeeds or fails, we take the resulting
> floor time, convert it to realtime and try to swap that into the ctime.
> 
> We take the result of the ctime swap whether it succeeds or fails, since
> either is just as valid.
> 
> Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> Others should be unaffected (other than being subject to the same floor
> value as multigrain filesystems).
> 
> Signed-off-by: Jeff Layton 

With that corrected,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/inode.c | 171 
> -
>  fs/stat.c  |  36 ++-
>  include/linux/fs.h |  34 ---
>  3 files changed, 204 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..2b5889ff7b36 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -60,6 +60,13 @@ static unsigned int i_hash_shift __ro_after_init;
>  static struct hlist_head *inode_hashtable __ro_after_init;
>  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>  
> +/*
> + * This represents the latest fine-grained time that we have handed out as a
> + * timestamp on the system. Tracked as a monotonic value, and converted to 
> the
> + * realtime clock on an as-needed basis.
> + */
> +static __cacheline_aligned_in_smp atomic64_t ctime_floor;
> +
>  /*
>   * Empty aops. Can be used for the cases where the user does not
>   * define any of the address_space operations.
> @@ -2127,19 +2134,72 @@ int file_remove_privs(struct file *file)
>  }
>  EXPORT_SYMBOL(file_remove_privs);
>  
> +/**
> + * coarse_ctime - return the current coarse-grained time
> + * @floor: current (monotonic) ctime_floor value
> + *
> + * Get the coarse-grained time, and then determine whether to
> + * return it or the current floor value. Returns the later of the
> + * floor and coarse grained timestamps, converted to realtime
> + * clock value.
> + */
> +static ktime_t coarse_ctime(ktime_t floor)
> +{
> + ktime_t coarse = ktime_get_coarse();
> +
> + /* If coarse time is already newer, return that */
> + if (!ktime_after(floor, coarse))
> + return ktime_get_coarse_real();
> + return ktime_mono_to_real(floor);
> +}
> +
> +/**
> +

Re: [PATCH v5 4/9] fs: have setattr_copy handle multigrain timestamps appropriately

2024-07-11 Thread Darrick J. Wong

On Thu, Jul 11, 2024 at 07:08:08AM -0400, Jeff Layton wrote:
> The setattr codepath is still using coarse-grained timestamps, even on
> multigrain filesystems. To fix this, we need to fetch the timestamp for
> ctime updates later, at the point where the assignment occurs in
> setattr_copy.
> 
> On a multigrain inode, ignore the ia_ctime in the attrs, and always
> update the ctime to the current clock value. Update the atime and mtime
> with the same value (if needed) unless they are being set to other
> specific values, a'la utimes().
> 
> Note that we don't want to do this universally however, as some
> filesystems (e.g. most networked fs) want to do an explicit update
> elsewhere before updating the local inode.
> 
> Signed-off-by: Jeff Layton 

Makes sense to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/attr.c | 52 ++--
>  1 file changed, 46 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/attr.c b/fs/attr.c
> index 825007d5cda4..e03ea6951864 100644
> --- a/fs/attr.c
> +++ b/fs/attr.c
> @@ -271,6 +271,42 @@ int inode_newsize_ok(const struct inode *inode, loff_t 
> offset)
>  }
>  EXPORT_SYMBOL(inode_newsize_ok);
>  
> +/**
> + * setattr_copy_mgtime - update timestamps for mgtime inodes
> + * @inode: inode timestamps to be updated
> + * @attr: attrs for the update
> + *
> + * With multigrain timestamps, we need to take more care to prevent races
> + * when updating the ctime. Always update the ctime to the very latest
> + * using the standard mechanism, and use that to populate the atime and
> + * mtime appropriately (unless we're setting those to specific values).
> + */
> +static void setattr_copy_mgtime(struct inode *inode, const struct iattr 
> *attr)
> +{
> + unsigned int ia_valid = attr->ia_valid;
> + struct timespec64 now;
> +
> + /*
> +  * If the ctime isn't being updated then nothing else should be
> +  * either.
> +  */
> + if (!(ia_valid & ATTR_CTIME)) {
> + WARN_ON_ONCE(ia_valid & (ATTR_ATIME|ATTR_MTIME));
> + return;
> + }
> +
> + now = inode_set_ctime_current(inode);
> + if (ia_valid & ATTR_ATIME_SET)
> + inode_set_atime_to_ts(inode, attr->ia_atime);
> + else if (ia_valid & ATTR_ATIME)
> + inode_set_atime_to_ts(inode, now);
> +
> + if (ia_valid & ATTR_MTIME_SET)
> + inode_set_mtime_to_ts(inode, attr->ia_mtime);
> + else if (ia_valid & ATTR_MTIME)
> + inode_set_mtime_to_ts(inode, now);
> +}
> +
>  /**
>   * setattr_copy - copy simple metadata updates into the generic inode
>   * @idmap:   idmap of the mount the inode was found from
> @@ -303,12 +339,6 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
> *inode,
>  
>   i_uid_update(idmap, attr, inode);
>   i_gid_update(idmap, attr, inode);
> - if (ia_valid & ATTR_ATIME)
> - inode_set_atime_to_ts(inode, attr->ia_atime);
> - if (ia_valid & ATTR_MTIME)
> - inode_set_mtime_to_ts(inode, attr->ia_mtime);
> - if (ia_valid & ATTR_CTIME)
> - inode_set_ctime_to_ts(inode, attr->ia_ctime);
>   if (ia_valid & ATTR_MODE) {
>   umode_t mode = attr->ia_mode;
>   if (!in_group_or_capable(idmap, inode,
> @@ -316,6 +346,16 @@ void setattr_copy(struct mnt_idmap *idmap, struct inode 
> *inode,
>   mode &= ~S_ISGID;
>   inode->i_mode = mode;
>   }
> +
> + if (is_mgtime(inode))
> + return setattr_copy_mgtime(inode, attr);
> +
> + if (ia_valid & ATTR_ATIME)
> + inode_set_atime_to_ts(inode, attr->ia_atime);
> + if (ia_valid & ATTR_MTIME)
> + inode_set_mtime_to_ts(inode, attr->ia_mtime);
> + if (ia_valid & ATTR_CTIME)
> + inode_set_ctime_to_ts(inode, attr->ia_ctime);
>  }
>  EXPORT_SYMBOL(setattr_copy);
>  
> 
> -- 
> 2.45.2
> 
>

Re: [PATCH v5 2/9] fs: tracepoints around multigrain timestamp events

2024-07-11 Thread Darrick J. Wong

On Thu, Jul 11, 2024 at 07:08:06AM -0400, Jeff Layton wrote:
> Add some tracepoints around various multigrain timestamp events.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/inode.c   |   5 ++
>  fs/stat.c|   3 ++
>  include/trace/events/timestamp.h | 109 
> +++
>  3 files changed, 117 insertions(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 2b5889ff7b36..81b45e0a95a6 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -22,6 +22,9 @@
>  #include 
>  #include 
>  #include 
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  #include "internal.h"
>  
>  /*
> @@ -2571,6 +2574,7 @@ struct timespec64 inode_set_ctime_to_ts(struct inode 
> *inode, struct timespec64 t
>  {
>   inode->i_ctime_sec = ts.tv_sec;
>   inode->i_ctime_nsec = ts.tv_nsec & ~I_CTIME_QUERIED;
> + trace_inode_set_ctime_to_ts(inode, &ts);
>   return ts;
>  }
>  EXPORT_SYMBOL(inode_set_ctime_to_ts);
> @@ -2670,6 +2674,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
> *inode)
>   if (try_cmpxchg(&inode->i_ctime_nsec, &cur, now_ts.tv_nsec)) {
>   /* If swap occurred, then we're (mostly) done */
>   inode->i_ctime_sec = now_ts.tv_sec;
> + trace_ctime_ns_xchg(inode, cns, now_ts.tv_nsec, cur);
>   } else {
>   /*
>* Was the change due to someone marking the old ctime QUERIED?
> diff --git a/fs/stat.c b/fs/stat.c
> index df7fdd3afed9..552dfd67688b 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -23,6 +23,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  #include "internal.h"
>  #include "mount.h"
>  
> @@ -49,6 +51,7 @@ void fill_mg_cmtime(struct kstat *stat, u32 request_mask, 
> struct inode *inode)
>   stat->mtime = inode_get_mtime(inode);
>   stat->ctime.tv_sec = inode->i_ctime_sec;
>   stat->ctime.tv_nsec = ((u32)atomic_fetch_or(I_CTIME_QUERIED, pcn)) & 
> ~I_CTIME_QUERIED;
> + trace_fill_mg_cmtime(inode, &stat->ctime, &stat->mtime);
>  }
>  EXPORT_SYMBOL(fill_mg_cmtime);
>  
> diff --git a/include/trace/events/timestamp.h 
> b/include/trace/events/timestamp.h
> new file mode 100644
> index ..3a603190b46c
> --- /dev/null
> +++ b/include/trace/events/timestamp.h
> @@ -0,0 +1,109 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM timestamp
> +
> +#if !defined(_TRACE_TIMESTAMP_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_TIMESTAMP_H
> +
> +#include 
> +#include 
> +
> +TRACE_EVENT(inode_set_ctime_to_ts,
> + TP_PROTO(struct inode *inode,
> +  struct timespec64 *ctime),
> +
> + TP_ARGS(inode, ctime),
> +
> + TP_STRUCT__entry(
> + __field(dev_t,  dev)
> + __field(ino_t,  ino)
> + __field(time64_t,   ctime_s)
> + __field(u32,ctime_ns)
> + __field(u32,gen)
> + ),
> +
> + TP_fast_assign(
> + __entry->dev= inode->i_sb->s_dev;

Odd indenting of the second columns between the struct definition above
and the assignment code here.

> + __entry->ino= inode->i_ino;
> + __entry->gen= inode->i_generation;
> + __entry->ctime_s= ctime->tv_sec;
> + __entry->ctime_ns   = ctime->tv_nsec;
> + ),
> +
> + TP_printk("ino=%d:%d:%ld:%u ctime=%lld.%u",
> + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
> __entry->gen,
> + __entry->ctime_s, __entry->ctime_ns
> + )
> +);
> +
> +TRACE_EVENT(ctime_ns_xchg,
> + TP_PROTO(struct inode *inode,
> +  u32 old,
> +  u32 new,
> +  u32 cur),
> +
> + TP_ARGS(inode, old, new, cur),
> +
> + TP_STRUCT__entry(
> + __field(dev_t,  dev)
> + __field(ino_t,  ino)
> + __field(u32,gen)
> + __field(u32,old)
> + __field(u32,new)
> + __field(u32,cur)
> + ),
> +
> + TP_fast_assign(
> + __entry->dev= inode->i_sb->s_dev;
> + __entry->ino= inode->i_ino;
> + __entry->gen= inode->i_generation;
> + __entry->old= old;
> + __entry->new= new;
> + __entry->cur= cur;
> + ),
> +
> + TP_printk("ino=%d:%d:%ld:%u old=%u:%c new=%u cur=%u:%c",
> + MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, 
> __entry->gen,
> + __entry->old & ~I_CTIME_QUERIED, __entry->old & I_CTIME_QUERIED 
> ? 'Q' : '-',
> + __entry->new,
> + __entry->cur & ~I_CTIME_QUERIED, __entry->cur & I_CTIME_QUERIED 
> ? 'Q' : '-'

Thi

Re: [PATCH v5 6/9] xfs: switch to multigrain timestamps

2024-07-11 Thread Darrick J. Wong

On Thu, Jul 11, 2024 at 07:08:10AM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> always set.
> 
> Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> should give us better semantics now.

Following up on "As long as the fs isn't touching i_ctime_nsec directly,
you shouldn't need to worry about this" from:
https://lore.kernel.org/linux-xfs/cae5c28f172ac57b7eaaa98a00b23f342f01ba64.ca...@kernel.org/

xfs /does/ touch i_ctime_nsec directly when it's writing inodes to disk.
>From xfs_inode_to_disk, see:

to->di_ctime = xfs_inode_to_disk_ts(ip, inode_get_ctime(inode));

AFAICT, inode_get_ctime itself remains unchanged, and still returns
inode->__i_ctime, right?  In which case it's returning a raw timespec64,
which can include the QUERIED flag in tv_nsec, right?

Now let's look at the consumer:

static inline xfs_timestamp_t
xfs_inode_to_disk_ts(
struct xfs_inode*ip,
const struct timespec64 tv)
{
struct xfs_legacy_timestamp *lts;
xfs_timestamp_t ts;

if (xfs_inode_has_bigtime(ip))
return cpu_to_be64(xfs_inode_encode_bigtime(tv));

lts = (struct xfs_legacy_timestamp *)&ts;
lts->t_sec = cpu_to_be32(tv.tv_sec);
lts->t_nsec = cpu_to_be32(tv.tv_nsec);

return ts;
}

For the !bigtime case (aka before we added y2038 support) the queried
flag gets encoded into the tv_nsec field since xfs doesn't filter the
queried flag.

For the bigtime case, the timespec is turned into an absolute nsec count
since the xfs epoch (which is the minimum timestamp possible under the
old encoding scheme):

static inline uint64_t xfs_inode_encode_bigtime(struct timespec64 tv)
{
return xfs_unix_to_bigtime(tv.tv_sec) * NSEC_PER_SEC + tv.tv_nsec;
}

Here we'd also be mixing in the QUERIED flag, only now we've encoded a
time that's a second in the future.  I think the solution is to add a:

static inline struct timespec64
inode_peek_ctime(const struct inode *inode)
{
return (struct timespec64){
.tv_sec = inode->__i_ctime.tv_sec,
.tv_nsec = inode->__i_ctime.tv_nsec & ~I_CTIME_QUERIED,
};
}

similar to what inode_peek_iversion does for iversion; and then
xfs_inode_to_disk can do:

to->di_ctime = xfs_inode_to_disk_ts(ip, inode_peek_ctime(inode));

which would prevent I_CTIME_QUERIED from going out to disk.

At load time, xfs_inode_from_disk uses inode_set_ctime_to_ts so I think
xfs won't accidentally introduce QUERIED when it's loading an inode from
disk.

--D

> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
>  fs/xfs/xfs_iops.c   | 10 +++---
>  fs/xfs/xfs_super.c  |  2 +-
>  3 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 69fc5b981352..1f3639bbf5f0 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + ASSERT(flags & XFS_ICHGTIME_CHG);
>  
> + tv = inode_set_ctime_current(inode);
>   if (flags & XFS_ICHGTIME_MOD)
>   inode_set_mtime_to_ts(inode, tv);
> - if (flags & XFS_ICHGTIME_CHG)
> - inode_set_ctime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CREATE)
>   ip->i_crtime = tv;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a00dcbc77e12..d25872f818fa 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -592,8 +592,9 @@ xfs_vn_getattr(
>   stat->gid = vfsgid_into_kgid(vfsgid);
>   stat->ino = ip->i_ino;
>   stat->atime = inode_get_atime(inode);
> - stat->mtime = inode_get_mtime(inode);
> - stat->ctime = inode_get_ctime(inode);
> +
> + fill_mg_cmtime(stat, request_mask, inode);
> +
>   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
>  
>   if (xfs_has_v3inodes(mp)) {
> @@ -603,11 +604,6 @@ xfs_vn_getattr(
>   }
>   }
>  
> - if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> - stat->change_cookie = inode_query_iversion(inode);
> - stat->result_mask |= STATX_CHANGE_COOKIE;
> - }
> -
>   /*
>* Note: If you add another clause to set an attribute flag, please
>* update attributes_mask below.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_sup

Re: [PATCH v4 6/9] xfs: switch to multigrain timestamps

2024-07-08 Thread Darrick J. Wong

On Mon, Jul 08, 2024 at 02:51:07PM -0400, Jeff Layton wrote:
> On Mon, 2024-07-08 at 11:47 -0700, Darrick J. Wong wrote:
> > On Mon, Jul 08, 2024 at 11:53:39AM -0400, Jeff Layton wrote:
> > > Enable multigrain timestamps, which should ensure that there is an
> > > apparent change to the timestamp whenever it has been written after
> > > being actively observed via getattr.
> > > 
> > > Also, anytime the mtime changes, the ctime must also change, and those
> > > are now the only two options for xfs_trans_ichgtime. Have that function
> > > unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> > > always set.
> > > 
> > > Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> > > should give us better semantics now.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
> > >  fs/xfs/xfs_iops.c   | 10 +++---
> > >  fs/xfs/xfs_super.c  |  2 +-
> > >  3 files changed, 7 insertions(+), 11 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_inode.c 
> > > b/fs/xfs/libxfs/xfs_trans_inode.c
> > > index 69fc5b981352..1f3639bbf5f0 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_inode.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> > > @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
> > >   ASSERT(tp);
> > >   xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
> > >  
> > > - tv = current_time(inode);
> > > + /* If the mtime changes, then ctime must also change */
> > > + ASSERT(flags & XFS_ICHGTIME_CHG);
> > >  
> > > + tv = inode_set_ctime_current(inode);
> > >   if (flags & XFS_ICHGTIME_MOD)
> > >   inode_set_mtime_to_ts(inode, tv);
> > > - if (flags & XFS_ICHGTIME_CHG)
> > > - inode_set_ctime_to_ts(inode, tv);
> > >   if (flags & XFS_ICHGTIME_CREATE)
> > >   ip->i_crtime = tv;
> > >  }
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index a00dcbc77e12..d25872f818fa 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -592,8 +592,9 @@ xfs_vn_getattr(
> > >   stat->gid = vfsgid_into_kgid(vfsgid);
> > >   stat->ino = ip->i_ino;
> > >   stat->atime = inode_get_atime(inode);
> > > - stat->mtime = inode_get_mtime(inode);
> > > - stat->ctime = inode_get_ctime(inode);
> > > +
> > > + fill_mg_cmtime(stat, request_mask, inode);
> > 
> > Sooo... for setting up a commit-range operation[1], XFS_IOC_START_COMMIT
> > could populate its freshness data by calling:
> > 
> > struct kstat dummy;
> > 
> > fill_mg_ctime(&dummy, STATX_CTIME | STATX_MTIME, inode);
> > 
> > and then using dummy.[cm]time to populate the freshness data that it
> > gives to userspace, right?  Having set QUERIED, a write to the file
> > immediately afterwards will cause a (tiny) increase in ctime_nsec which
> > will cause the XFS_IOC_COMMIT_RANGE to reject the commit[2].  Right?
> > 
> 
> Yes. Once you call fill_mg_ctime, the first write after that point
> should cause the kernel to ensure that there is a distinct change in
> the ctime.
> 
> IOW, I think this should alleviate the concerns I had before with using
> timestamps with the XFS_IOC_COMMIT_RANGE interface.

Cool, thank you!  Apologies for roaring earlier.

--D

> > --D
> > 
> > [1] https://lore.kernel.org/linux-xfs/20240227174649.GL6184@frogsfrogsfrogs/
> > [2] 
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=atomic-file-commits&id=0520d89c2698874c1f56ddf52ec4b8a3595baa14
> > 
> > > +
> > >   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
> > >  
> > >   if (xfs_has_v3inodes(mp)) {
> > > @@ -603,11 +604,6 @@ xfs_vn_getattr(
> > >   }
> > >   }
> > >  
> > > - if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> > > - stat->change_cookie = inode_query_iversion(inode);
> > > - stat->result_mask |= STATX_CHANGE_COOKIE;
> > > - }
> > > -
> > >   /*
> > >* Note: If you add another clause to set an attribute flag, please
> > >* update attributes_mask below.
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 27e9f749c4c7..210481b03fdb 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
> > >   .init_fs_context= xfs_init_fs_context,
> > >   .parameters = xfs_fs_parameters,
> > >   .kill_sb= xfs_kill_sb,
> > > - .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> > > + .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
> > >  };
> > >  MODULE_ALIAS_FS("xfs");
> > >  
> > > 
> > > -- 
> > > 2.45.2
> > > 
> > > 
> 
> -- 
> Jeff Layton 
>

Re: [PATCH v4 6/9] xfs: switch to multigrain timestamps

2024-07-08 Thread Darrick J. Wong

On Mon, Jul 08, 2024 at 11:53:39AM -0400, Jeff Layton wrote:
> Enable multigrain timestamps, which should ensure that there is an
> apparent change to the timestamp whenever it has been written after
> being actively observed via getattr.
> 
> Also, anytime the mtime changes, the ctime must also change, and those
> are now the only two options for xfs_trans_ichgtime. Have that function
> unconditionally bump the ctime, and ASSERT that XFS_ICHGTIME_CHG is
> always set.
> 
> Finally, stop setting STATX_CHANGE_COOKIE in getattr, since the ctime
> should give us better semantics now.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++---
>  fs/xfs/xfs_iops.c   | 10 +++---
>  fs/xfs/xfs_super.c  |  2 +-
>  3 files changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
> index 69fc5b981352..1f3639bbf5f0 100644
> --- a/fs/xfs/libxfs/xfs_trans_inode.c
> +++ b/fs/xfs/libxfs/xfs_trans_inode.c
> @@ -62,12 +62,12 @@ xfs_trans_ichgtime(
>   ASSERT(tp);
>   xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>  
> - tv = current_time(inode);
> + /* If the mtime changes, then ctime must also change */
> + ASSERT(flags & XFS_ICHGTIME_CHG);
>  
> + tv = inode_set_ctime_current(inode);
>   if (flags & XFS_ICHGTIME_MOD)
>   inode_set_mtime_to_ts(inode, tv);
> - if (flags & XFS_ICHGTIME_CHG)
> - inode_set_ctime_to_ts(inode, tv);
>   if (flags & XFS_ICHGTIME_CREATE)
>   ip->i_crtime = tv;
>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a00dcbc77e12..d25872f818fa 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -592,8 +592,9 @@ xfs_vn_getattr(
>   stat->gid = vfsgid_into_kgid(vfsgid);
>   stat->ino = ip->i_ino;
>   stat->atime = inode_get_atime(inode);
> - stat->mtime = inode_get_mtime(inode);
> - stat->ctime = inode_get_ctime(inode);
> +
> + fill_mg_cmtime(stat, request_mask, inode);

Sooo... for setting up a commit-range operation[1], XFS_IOC_START_COMMIT
could populate its freshness data by calling:

struct kstat dummy;

fill_mg_ctime(&dummy, STATX_CTIME | STATX_MTIME, inode);

and then using dummy.[cm]time to populate the freshness data that it
gives to userspace, right?  Having set QUERIED, a write to the file
immediately afterwards will cause a (tiny) increase in ctime_nsec which
will cause the XFS_IOC_COMMIT_RANGE to reject the commit[2].  Right?

--D

[1] https://lore.kernel.org/linux-xfs/20240227174649.GL6184@frogsfrogsfrogs/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=atomic-file-commits&id=0520d89c2698874c1f56ddf52ec4b8a3595baa14

> +
>   stat->blocks = XFS_FSB_TO_BB(mp, ip->i_nblocks + ip->i_delayed_blks);
>  
>   if (xfs_has_v3inodes(mp)) {
> @@ -603,11 +604,6 @@ xfs_vn_getattr(
>   }
>   }
>  
> - if ((request_mask & STATX_CHANGE_COOKIE) && IS_I_VERSION(inode)) {
> - stat->change_cookie = inode_query_iversion(inode);
> - stat->result_mask |= STATX_CHANGE_COOKIE;
> - }
> -
>   /*
>* Note: If you add another clause to set an attribute flag, please
>* update attributes_mask below.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 27e9f749c4c7..210481b03fdb 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -2052,7 +2052,7 @@ static struct file_system_type xfs_fs_type = {
>   .init_fs_context= xfs_init_fs_context,
>   .parameters = xfs_fs_parameters,
>   .kill_sb= xfs_kill_sb,
> - .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP,
> + .fs_flags   = FS_REQUIRES_DEV | FS_ALLOW_IDMAP | FS_MGTIME,
>  };
>  MODULE_ALIAS_FS("xfs");
>  
> 
> -- 
> 2.45.2
> 
>

Re: [PATCH v4 1/9] fs: add infrastructure for multigrain timestamps

2024-07-08 Thread Darrick J. Wong

On Mon, Jul 08, 2024 at 11:53:34AM -0400, Jeff Layton wrote:
> The VFS has always used coarse-grained timestamps when updating the
> ctime and mtime after a change. This has the benefit of allowing
> filesystems to optimize away a lot metadata updates, down to around 1
> per jiffy, even when a file is under heavy writes.
> 
> Unfortunately, this has always been an issue when we're exporting via
> NFSv3, which relies on timestamps to validate caches. A lot of changes
> can happen in a jiffy, so timestamps aren't sufficient to help the
> client decide when to invalidate the cache. Even with NFSv4, a lot of
> exported filesystems don't properly support a change attribute and are
> subject to the same problems with timestamp granularity. Other
> applications have similar issues with timestamps (e.g backup
> applications).
> 
> If we were to always use fine-grained timestamps, that would improve the
> situation, but that becomes rather expensive, as the underlying
> filesystem would have to log a lot more metadata updates.
> 
> What we need is a way to only use fine-grained timestamps when they are
> being actively queried. Use the (unused) top bit in inode->i_ctime_nsec
> as a flag that indicates whether the current timestamps have been
> queried via stat() or the like. When it's set, we allow the kernel to
> use a fine-grained timestamp iff it's necessary to make the ctime show
> a different value.

I appreciate the v3->v4 change that we hide the QUERIED flag in the
upper bit of the ctime nanoseconds, instead of all support for post-2262
timestamps.  Thank you. :)

> This solves the problem of being able to distinguish the timestamp
> between updates, but introduces a new problem: it's now possible for a
> file being changed to get a fine-grained timestamp. A file that is
> altered just a bit later can then get a coarse-grained one that appears
> older than the earlier fine-grained time. This violates timestamp
> ordering guarantees.
> 
> To remedy this, keep a global monotonic ktime_t value that acts as a
> timestamp floor.  When we go to stamp a file, we first get the latter of
> the current floor value and the current coarse-grained time. If the
> inode ctime hasn't been queried then we just attempt to stamp it with
> that value.
> 
> If it has been queried, then first see whether the current coarse time
> is later than the existing ctime. If it is, then we accept that value.
> If it isn't, then we get a fine-grained time and try to swap that into
> the global floor. Whether that succeeds or fails, we take the resulting
> floor time, convert it to realtime and try to swap that into the ctime.

Makes sense to me.  One question, though -- mgtime filesystems that want
to persist a ctime to disk are going to have to do something like this,
right?

di_ctime_ns = cpu_to_be32(atomic_read(&inode->i_ctime_nsec) &
  ~I_CTIME_QUERIED);

IOWs, they need to mask off the QUERIED flag (aka bit 31) so that they
never store a strange looking nanoseconds value.  Probably they should
already be doing this, but I wouldn't trust them already to be clamping
the nsec value.

I'm mostly thinking of xfs_inode_to_disk, which currently calls
inode_get_ctime() but doesn't clamp nsec at all before writing it to
disk.  Does that need to mask off I_CTIME_QUERIED explicitly?

> We take the result of the ctime swap whether it succeeds or fails, since
> either is just as valid.
> 
> Filesystems can opt into this by setting the FS_MGTIME fstype flag.
> Others should be unaffected (other than being subject to the same floor
> value as multigrain filesystems).
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/inode.c | 171 
> -
>  fs/stat.c  |  36 ++-
>  include/linux/fs.h |  34 ---
>  3 files changed, 204 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index f356fe2ec2b6..10ed1d3d9b52 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -60,6 +60,13 @@ static unsigned int i_hash_shift __ro_after_init;
>  static struct hlist_head *inode_hashtable __ro_after_init;
>  static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
>  
> +/*
> + * This represents the latest fine-grained time that we have handed out as a
> + * timestamp on the system. Tracked as a monotonic value, and converted to 
> the
> + * realtime clock on an as-needed basis.
> + */
> +static __cacheline_aligned_in_smp ktime_t ctime_floor;

ktime_get claims that it stops when the system is suspended; will that
cause problems after a resume?  I /think/ the answer is no because any
change to the file after a resume will first determine the coarse grain
ctime value, which will be far beyond the floor.  The new coarse grain
ctime will be written to the inode and become the new floor, as provided
by coarse_ctime().

The rest of the code here makes sense to me.

--D

> +
>  /*
>   * Empty aops. Can be used for the cases where the user does not
>   * de

Re: [PATCH 01/10] fs: turn inode ctime fields into a single ktime_t

2024-07-01 Thread Darrick J. Wong

On Wed, Jun 26, 2024 at 09:00:21PM -0400, Jeff Layton wrote:
> The ctime is not settable to arbitrary values. It always comes from the
> system clock, so we'll never stamp an inode with a value that can't be
> represented there. If we disregard people setting their system clock
> past the year 2262, there is no reason we can't replace the ctime fields
> with a ktime_t.
> 
> Switch the ctime fields to a single ktime_t. Move the i_generation down
> above i_fsnotify_mask and then move the i_version into the resulting 8
> byte hole. This shrinks struct inode by 8 bytes total, and should
> improve the cache footprint as the i_version and ctime are usually
> updated together.
> 
> The one downside I can see to switching to a ktime_t is that if someone
> has a filesystem with files on it that has ctimes outside the ktime_t
> range (before ~1678 AD or after ~2262 AD), we won't be able to display
> them properly in stat() without some special treatment in the
> filesystem. The operating assumption here is that that is not a
> practical problem.

What happens if a filesystem with the ability to store ctimes beyond
whatever ktime_t supports (AFAICT 2^63-1 nanonseconds on either side of
the Unix epoch)?  I think the behavior with your patch is that ktime_set
clamps the ctime on iget because the kernel can't handle it?

It's a little surprising that the ctime will suddenly jump back in time
to 2262, but maybe you're right that nobody will notice or care? ;)

--D

> 
> Signed-off-by: Jeff Layton 
> ---
>  include/linux/fs.h | 26 +++---
>  1 file changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 5ff362277834..5139dec085f2 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -662,11 +662,10 @@ struct inode {
>   loff_t  i_size;
>   time64_ti_atime_sec;
>   time64_ti_mtime_sec;
> - time64_ti_ctime_sec;
>   u32 i_atime_nsec;
>   u32 i_mtime_nsec;
> - u32 i_ctime_nsec;
> - u32 i_generation;
> + ktime_t __i_ctime;
> + atomic64_t  i_version;
>   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
>   unsigned short  i_bytes;
>   u8  i_blkbits;
> @@ -701,7 +700,6 @@ struct inode {
>   struct hlist_head   i_dentry;
>   struct rcu_head i_rcu;
>   };
> - atomic64_t  i_version;
>   atomic64_t  i_sequence; /* see futex */
>   atomic_ti_count;
>   atomic_ti_dio_count;
> @@ -724,6 +722,8 @@ struct inode {
>   };
>  
>  
> + u32 i_generation;
> +
>  #ifdef CONFIG_FSNOTIFY
>   __u32   i_fsnotify_mask; /* all events this inode cares 
> about */
>   /* 32-bit hole reserved for expanding i_fsnotify_mask */
> @@ -1608,29 +1608,25 @@ static inline struct timespec64 
> inode_set_mtime(struct inode *inode,
>   return inode_set_mtime_to_ts(inode, ts);
>  }
>  
> -static inline time64_t inode_get_ctime_sec(const struct inode *inode)
> +static inline struct timespec64 inode_get_ctime(const struct inode *inode)
>  {
> - return inode->i_ctime_sec;
> + return ktime_to_timespec64(inode->__i_ctime);
>  }
>  
> -static inline long inode_get_ctime_nsec(const struct inode *inode)
> +static inline time64_t inode_get_ctime_sec(const struct inode *inode)
>  {
> - return inode->i_ctime_nsec;
> + return inode_get_ctime(inode).tv_sec;
>  }
>  
> -static inline struct timespec64 inode_get_ctime(const struct inode *inode)
> +static inline long inode_get_ctime_nsec(const struct inode *inode)
>  {
> - struct timespec64 ts = { .tv_sec  = inode_get_ctime_sec(inode),
> -  .tv_nsec = inode_get_ctime_nsec(inode) };
> -
> - return ts;
> + return inode_get_ctime(inode).tv_nsec;
>  }
>  
>  static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
> struct timespec64 ts)
>  {
> - inode->i_ctime_sec = ts.tv_sec;
> - inode->i_ctime_nsec = ts.tv_nsec;
> + inode->__i_ctime = ktime_set(ts.tv_sec, ts.tv_nsec);
>   return ts;
>  }
>  
> 
> -- 
> 2.45.2
> 
>

Re: [PATCH v4 7/7] fs/xfs: Add dedupe support for fsdax

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:32PM +0800, Shiyang Ruan wrote:
> Add xfs_break_two_dax_layouts() to break layout for tow dax files.  Then
> call compare range function only when files are both DAX or not.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_file.c| 20 
>  fs/xfs/xfs_inode.c   |  8 +++-
>  fs/xfs/xfs_inode.h   |  1 +
>  fs/xfs/xfs_reflink.c |  5 +++--
>  4 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5795d5d6f869..1fd457167c12 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -842,6 +842,26 @@ xfs_break_dax_layouts(
>   0, 0, xfs_wait_dax_page(inode));
>  }
>  
> +int
> +xfs_break_two_dax_layouts(
> + struct inode*src,
> + struct inode*dest)
> +{
> + int error;
> + boolretry = false;
> +
> +retry:
> + error = xfs_break_dax_layouts(src, &retry);
> + if (error || retry)
> + goto retry;
> +
> + error = xfs_break_dax_layouts(dest, &retry);
> + if (error || retry)
> + goto retry;
> +
> + return error;
> +}
> +
>  int
>  xfs_break_layouts(
>   struct inode*inode,
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index f93370bd7b1e..c01786917eef 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3713,8 +3713,10 @@ xfs_ilock2_io_mmap(
>   struct xfs_inode*ip2)
>  {
>   int ret;
> + struct inode*inode1 = VFS_I(ip1);
> + struct inode*inode2 = VFS_I(ip2);
>  
> - ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
> + ret = xfs_iolock_two_inodes_and_break_layout(inode1, inode2);
>   if (ret)
>   return ret;
>   if (ip1 == ip2)
> @@ -3722,6 +3724,10 @@ xfs_ilock2_io_mmap(
>   else
>   xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
>   ip2, XFS_MMAPLOCK_EXCL);
> +
> + if (IS_DAX(inode1) && IS_DAX(inode2))
> + ret = xfs_break_two_dax_layouts(inode1, inode2);

This is wrong on many levels.

The first problem is that xfs_break_two_dax_layouts calls
xfs_break_dax_layouts twice even if inode1 == inode2, which is
unnecessary.

The second problem is that xfs_break_dax_layouts can cycle the MMAPLOCK
on the inode that it's processing.  Since there are two inodes in play
here, you must be /very/ careful about maintaining correct locking order,
which for the MMAPLOCK is increasing order of xfs_inode.i_ino.  If you
drop the MMAPLOCK for the lower-numbered inode for any reason, you have
to drop both MMAPLOCKs and try again.

In other words, you have to replace all that nice MMAPLOCK code with a
new xfs_mmaplock_two_inodes_and_break_dax_layouts function that is
structured similarly to what xfs_iolock_two_inodes_and_break_layout
does for the IOLOCK and PNFS layouts.

> +
>   return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 88ee4c3930ae..5ef21924dddc 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -435,6 +435,7 @@ enum xfs_prealloc_flags {
>  
>  int  xfs_update_prealloc_flags(struct xfs_inode *ip,
> enum xfs_prealloc_flags flags);
> +int  xfs_break_two_dax_layouts(struct inode *inode1, struct inode *inode2);
>  int  xfs_break_layouts(struct inode *inode, uint *iolock,
>   enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index a4cd6e8a7aa0..4426bcc8a985 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -29,6 +29,7 @@
>  #include "xfs_iomap.h"
>  #include "xfs_sb.h"
>  #include "xfs_ag_resv.h"
> +#include 

Why is this necessary?

--D

>  
>  /*
>   * Copy on Write of Shared Blocks
> @@ -1324,8 +1325,8 @@ xfs_reflink_remap_prep(
>   if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>   goto out_unlock;
>  
> - /* Don't share DAX file data for now. */
> - if (IS_DAX(inode_in) || IS_DAX(inode_out))
> + /* Don't share DAX file data with non-DAX file. */
> + if (IS_DAX(inode_in) != IS_DAX(inode_out))
>   goto out_unlock;
>  
>   if (!IS_DAX(inode_in))
> -- 
> 2.31.0
> 
> 
>

Re: [PATCH v4 6/7] fs/xfs: Handle CoW for fsdax write() path

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:31PM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
> CoW, new allocated extents needs to be remapped to the file.  So, add an
> iomap_end for dax write ops to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c  |  9 +++
>  fs/xfs/xfs_iomap.c | 58 +-
>  fs/xfs/xfs_iomap.h |  4 +++
>  fs/xfs/xfs_iops.c  |  7 +++--
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 69 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index e7d68318e6a5..9fcea33dd2c9 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -954,8 +954,7 @@ xfs_free_file_space(
>   return 0;
>   if (offset + len > XFS_ISIZE(ip))
>   len = XFS_ISIZE(ip) - offset;
> - error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - &xfs_buffered_write_iomap_ops);
> + error = xfs_iomap_zero_range(VFS_I(ip), offset, len, NULL);
>   if (error)
>   return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a007ca0711d9..5795d5d6f869 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -684,11 +684,8 @@ xfs_file_dax_write(
>   pos = iocb->ki_pos;
>  
>   trace_xfs_file_dax_write(iocb, from);
> - ret = dax_iomap_rw(iocb, from, &xfs_direct_write_iomap_ops);
> - if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> - i_size_write(inode, iocb->ki_pos);
> - error = xfs_setfilesize(ip, pos, ret);
> - }
> + ret = dax_iomap_rw(iocb, from, &xfs_dax_write_iomap_ops);
> +
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> @@ -1309,7 +1306,7 @@ __xfs_filemap_fault(
>  
>   ret = dax_iomap_fault(vmf, pe_size, &pfn, NULL,
>   (write_fault && !vmf->cow_page) ?
> -  &xfs_direct_write_iomap_ops :
> +  &xfs_dax_write_iomap_ops :
>&xfs_read_iomap_ops);
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e17ab7f42928..f818f989687b 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -760,7 +760,8 @@ xfs_direct_write_iomap_begin(
>  
>   /* may drop and re-acquire the ilock */
>   error = xfs_reflink_allocate_cow(ip, &imap, &cmap, &shared,
> - &lockmode, flags & IOMAP_DIRECT);
> + &lockmode,
> + flags & IOMAP_DIRECT || IS_DAX(inode));

Parentheses, please:
(flags & IOMAP_DIRECT) || IS_DAX(inode));

>   if (error)
>   goto out_unlock;
>   if (shared)
> @@ -853,6 +854,38 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>   .iomap_begin= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> + struct inode*inode,
> + loff_t  pos,
> + loff_t  length,
> + ssize_t written,
> + unsigned intflags,
> + struct iomap*iomap)
> +{
> + int error = 0;
> + xfs_inode_t *ip = XFS_I(inode);

Please don't use typedefs:

struct xfs_inode*ip = XFS_I(inode);

> + boolcow = xfs_is_cow_inode(ip);
> +
> + if (pos + written > i_size_read(inode)) {

What if we wrote zero bytes?  Usually that means error, right?

> + i_size_write(inode, pos + written);
> + error = xfs_setfilesize(ip, pos, written);
> + if (error && cow) {
> + xfs_reflink_cancel_cow_range(ip, pos, written, true);
> + return error;
> + }
> + }
> + if (cow)
> + error = xfs_reflink_end_cow(ip, pos, written);
> +
> + return error;
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> + .iomap_begin= xfs_direct_write_iomap_begin,
> + .iomap_end  = xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(
>   struct inode*inode,
> @@ -1314,3 +1347,26 @@ xfs_xattr_iomap_begin(
>  const struct iomap_ops xfs_xattr_iomap_ops = {
>   .iomap_begin= xfs_xattr_iomap_begin,
>  };
> +
> +int
> +xfs_iomap_zero_range(
> + struct inode*inode,

Might as well pass the xfs_inode pointers directly into these two functions.

--D

> + loff_t  offset,
> + loff_t  len,
> + bool*did_zero)
> +{
> + return iomap_zero_range(inode, offset, len, di

Re: [PATCH v4 5/7] fsdax: Dedup file range to use a compare function

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:30PM +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with
> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 56 
>  fs/remap_range.c | 45 ---
>  fs/xfs/xfs_reflink.c |  9 +--
>  include/linux/dax.h  |  4 
>  include/linux/fs.h   | 12 ++
>  5 files changed, 112 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fcd1e932716e..ba924b6629a6 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1849,3 +1849,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap)
> +{
> + void *saddr, *daddr;
> + bool *same = data;
> + int ret;
> +
> + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> + *same = true;
> + return len;
> + }
> +
> + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> + *same = false;
> + return 0;
> + }
> +
> + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> +   &saddr, NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> +   &daddr, NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + *same = !memcmp(saddr, daddr, len);
> + return len;
> +}
> +
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> + const struct iomap_ops *ops)
> +{
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> +is_same, dax_range_compare_actor);
> + if (ret < 0 || !*is_same)
> + goto out;
> +
> + len -= ret;
> + srcoff += ret;
> + destoff += ret;
> + }
> + ret = 0;
> +out:
> + dax_read_unlock(id);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index e4a5fdd7ad7b..1fab0db49c68 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  #include 
> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, 
> struct page *page2)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> -  struct inode *dest, loff_t destoff,
> -  loff_t len, bool *is_same)
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +   struct inode *dest, loff_t destoff,
> +   loff_t len, bool *is_same)
>  {
>   loff_t src_poff;
>   loff_t dest_poff;
> @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>  out_error:
>   return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>  
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>   * If there's an error, then the usual negative error code is returned.
>   * Otherwise returns 0 with *len set to the request length.
>   */
> -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -   struct file *file_out, loff_t pos_out,
> -   loff_t *len, unsigned int remap_flags)
> +static int
> +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + loff_t *len, unsigned int remap_flags,
> + const struct iomap_ops *ops)

Can we rename @ops to @dax_read_ops instead?

>  {
>   struct inode *inode_in = file_inode(file_in);
>   struct inode *inode_out = file_inode(file_out);
> @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, 
> loff_t pos_in,
>   if (remap_flags & REMA

Re: [PATCH v4 4/7] iomap: Introduce iomap_apply2() for operations on two files

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:29PM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
> 
> Signed-off-by: Shiyang Ruan 

Kinda wish we weren't propagating even more indirect call usage, but oh
well.

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/apply.c  | 52 +++
>  include/linux/iomap.h |  7 +-
>  2 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..0493da5286ad 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,55 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>  
>   return written ? written : ret;
>  }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t 
> pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret, ret2 = 0;
> + loff_t len1 = length, len2, min_len;
> +
> + ret = ops->iomap_begin(ino1, pos1, len1, flags, &smap, NULL);
> + if (ret)
> + goto out;
> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> + len2 = min_t(loff_t, len1, smap.length);
> +
> + ret = ops->iomap_begin(ino2, pos2, len2, flags, &dmap, NULL);
> + if (ret)
> + goto out_src;
> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + min_len = min_t(loff_t, len2, dmap.length);
> +
> + written = actor(ino1, pos1, ino2, pos2, min_len, data, &smap, &dmap);
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret2 = ops->iomap_end(ino2, pos2, len2,
> +   written > 0 ? written : 0, flags, &dmap);
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, len1,
> +  written > 0 ? written : 0, flags, &smap);
> +out:
> + if (written)
> + return written;
> + return ret ?: ret2;
> +}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index d202fd2d0f91..9493c48bcc9c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -150,10 +150,15 @@ struct iomap_ops {
>   */
>  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
>   void *data, struct iomap *iomap, struct iomap *srcmap);
> -
> +typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap);
>  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
>   unsigned flags, const struct iomap_ops *ops, void *data,
>   iomap_actor_t actor);
> +loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
> + loff_t pos2, loff_t length, unsigned int flags,
> + const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
>  
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>   const struct iomap_ops *ops);
> -- 
> 2.31.0
> 
> 
>

Re: [PATCH v4 2/7] fsdax: Replace mmap entry in case of CoW

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:27PM +0800, Shiyang Ruan wrote:
> We replace the existing entry to the newly allocated one in case of CoW.
> Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
> entry as writeprotected.  This helps us snapshots so new write
> pagefaults after snapshots trigger a CoW.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c | 39 ---
>  1 file changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b4fd3813457a..e6c1354b27a8 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -722,6 +722,10 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   return 0;
>  }
>  
> +/* DAX Insert Flag for the entry we insert */

Might be worth mentioning that these are xarray marks for the inserted
entry, since this comment didn't help much.

> +#define DAX_IF_DIRTY (1 << 0)
> +#define DAX_IF_COW   (1 << 1)
> +
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -729,16 +733,19 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   * already in the tree, we will skip the insertion and just dirty the PMD as
>   * appropriate.
>   */
> -static void *dax_insert_entry(struct xa_state *xas,
> - struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned int insert_flags)

Urk, two flags arguments.  Oh, I see.  We insert (shifted) pfn_t values
into the mapping as xarray values, so @flags determines the state flags
of the new entry value, whereas @insert_flags determines what xarray
mark we're going to attach (if any) to the inserted value.

--D

>  {
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -750,7 +757,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>   void *old;
>  
>   dax_disassociate_entry(entry, mapping, false);
> @@ -774,6 +781,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>   if (dirty)
>   xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>   xas_unlock_irq(xas);
>   return entry;
>  }
> @@ -1109,8 +1119,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>   vm_fault_t ret;
>  
> - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_ZERO_PAGE, false);
> + *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);
>  
>   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>   trace_dax_load_hole(inode, vmf, ret);
> @@ -1137,8 +1146,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>   goto fallback;
>  
>   pfn = page_to_pfn_t(zero_page);
> - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_PMD | DAX_ZERO_PAGE, false);
> + *entry = dax_insert_entry(xas, vmf, *entry, pfn,
> +   DAX_PMD | DAX_ZERO_PAGE, 0);
>  
>   if (arch_needs_pgtable_deposit()) {
>   pgtable = pte_alloc_one(vma->vm_mm);
> @@ -1444,6 +1453,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
> pfn_t *pfnp,
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
>   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
>   unsigned long entry_flags = pmd ? DAX_PMD : 0;
> + unsigned int insert_flags = 0;
>   int err = 0;
>   pfn_t pfn;
>   void *kaddr;
> @@ -1466,8 +1476,15 @@ static vm_fault_t dax_fault_actor(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   if (err)
>   return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
>  
> - *entry = dax_insert_entry(xas, mapping, vmf, *e

Re: [PATCH v4 1/7] fsdax: Introduce dax_iomap_cow_copy()

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:26PM +0800, Shiyang Ruan wrote:
> In the case where the iomap is a write operation and iomap is not equal
> to srcmap after iomap_begin, we consider it is a CoW operation.
> 
> The destance extent which iomap indicated is new allocated extent.
> So, it is needed to copy the data from srcmap to new allocated extent.
> In theory, it is better to copy the head and tail ranges which is
> outside of the non-aligned area instead of copying the whole aligned
> range. But in dax page fault, it will always be an aligned range.  So,
> we have to copy the whole range in this case.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/dax.c | 82 
>  1 file changed, 77 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 8d7e4e2cc0fb..b4fd3813457a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1038,6 +1038,61 @@ static int dax_iomap_direct_access(struct iomap 
> *iomap, loff_t pos, size_t size,
>   return rc;
>  }
>  
> +/**
> + * dax_iomap_cow_copy(): Copy the data from source to destination before 
> write.
> + * @pos: address to do copy from.
> + * @length:  size of copy operation.
> + * @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
> + * @srcmap:  iomap srcmap
> + * @daddr:   destination address to copy to.
> + *
> + * This can be called from two places. Either during DAX write fault, to copy
> + * the length size data to daddr. Or, while doing normal DAX write operation,
> + * dax_iomap_actor() might call this to do the copy of either start or end
> + * unaligned address. In this case the rest of the copy of aligned ranges is
> + * taken care by dax_iomap_actor() itself.

Er... what?  This description is very confusing to me.  /me reads the
code, and ...

OH.

Given a range (pos, length) and a mapping for a source file, this
function copies all the bytes between pos and (pos + length) to daddr if
the range is aligned to @align_size.  But if pos and length are not both
aligned to align_src then it'll copy *around* the range, leaving the
area in the middle uncopied waiting for write_iter to fill it in with
whatever's in the iovec.

Yikes, this function is doing double duty and ought to be split into
two functions.

The first function does the COW work for a write fault to an mmap
region and does a straight copy.  Page faults are always aligned, so
this functionality is needed by dax_fault_actor.  Maybe this could be
named dax_fault_cow?

The second function does the prep COW work *around* a write so that we
always copy entire page/blocks.  This cow-around code is needed by
dax_iomap_actor.  This should perhaps be named dax_iomap_cow_around()?

> + * Also, note DAX fault will always result in aligned pos and pos + length.
> + */
> +static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,
> + struct iomap *srcmap, void *daddr)
> +{
> + loff_t head_off = pos & (align_size - 1);
> + size_t size = ALIGN(head_off + length, align_size);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, align_size);
> + bool copy_all = head_off == 0 && end == pg_end;
> + void *saddr = 0;
> + int ret = 0;
> +
> + ret = dax_iomap_direct_access(srcmap, pos, size, &saddr, NULL);
> + if (ret)
> + return ret;
> +
> + if (copy_all) {
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + return ret ? -EIO : 0;

I find it /very/ interesting that copy_mc_to_kernel takes an unsigned
int argument but returns an unsigned long (counting the bytes that
didn't get copied, oddly...but that's an existing API so I guess I'll
let it go.)

> + }
> +
> + /* Copy the head part of the range.  Note: we pass offset as length. */
> + if (head_off) {
> + ret = copy_mc_to_kernel(daddr, saddr, head_off);
> + if (ret)
> + return -EIO;
> + }
> +
> + /* Copy the tail part of the range */
> + if (end < pg_end) {
> + loff_t tail_off = head_off + length;
> + loff_t tail_len = pg_end - end;
> +
> + ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
> + tail_len);
> + if (ret)
> + return -EIO;
> + }
> + return 0;
> +}
> +
>  /*
>   * The user has performed a load from a hole in the file.  Allocating a new
>   * page in the file would cause excessive storage usage for workloads with
> @@ -1167,11 +1222,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   struct dax_device *dax_dev = iomap->dax_dev;
>   struct iov_iter *iter = data;
>   loff_t end = pos + length, done = 0;
> + bool write = iov_iter_rw(iter) == WRITE;
>   ssize_t ret = 0;
>   size_t xfer;
>   int id;
>  
> - if (iov_iter_rw(iter) == READ) {
> + if (!write) {
>

Re: [PATCH v2 2/3] fsdax: Factor helper: dax_fault_actor()

2021-04-08 Thread Darrick J. Wong

On Wed, Apr 07, 2021 at 09:38:22PM +0800, Shiyang Ruan wrote:
> The core logic in the two dax page fault functions is similar. So, move
> the logic into a common helper function. Also, to facilitate the
> addition of new features, such as CoW, switch-case is no longer used to
> handle different iomap types.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c | 294 ---
>  1 file changed, 148 insertions(+), 146 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index f843fb8fbbf1..6dea1fc11b46 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1054,6 +1054,66 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   return ret;
>  }
>  
> +#ifdef CONFIG_FS_DAX_PMD
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> + struct iomap *iomap, void **entry)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + unsigned long pmd_addr = vmf->address & PMD_MASK;
> + struct vm_area_struct *vma = vmf->vma;
> + struct inode *inode = mapping->host;
> + pgtable_t pgtable = NULL;
> + struct page *zero_page;
> + spinlock_t *ptl;
> + pmd_t pmd_entry;
> + pfn_t pfn;
> +
> + zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
> +
> + if (unlikely(!zero_page))
> + goto fallback;
> +
> + pfn = page_to_pfn_t(zero_page);
> + *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> + DAX_PMD | DAX_ZERO_PAGE, false);
> +
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (!pgtable)
> + return VM_FAULT_OOM;
> + }
> +
> + ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
> + if (!pmd_none(*(vmf->pmd))) {
> + spin_unlock(ptl);
> + goto fallback;
> + }
> +
> + if (pgtable) {
> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> + mm_inc_nr_ptes(vma->vm_mm);
> + }
> + pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
> + pmd_entry = pmd_mkhuge(pmd_entry);
> + set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> + spin_unlock(ptl);
> + trace_dax_pmd_load_hole(inode, vmf, zero_page, *entry);
> + return VM_FAULT_NOPAGE;
> +
> +fallback:
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> + trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
> + return VM_FAULT_FALLBACK;
> +}
> +#else
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> + struct iomap *iomap, void **entry)
> +{
> + return VM_FAULT_FALLBACK;
> +}
> +#endif /* CONFIG_FS_DAX_PMD */
> +
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
> @@ -1291,6 +1351,64 @@ static vm_fault_t dax_fault_cow_page(struct vm_fault 
> *vmf, struct iomap *iomap,
>   return ret;
>  }
>  
> +/**
> + * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
> + * @vmf: vm fault instance
> + * @pfnp:pfn to be returned
> + * @xas: the dax mapping tree of a file
> + * @entry:   an unlocked dax entry to be inserted
> + * @pmd: distinguish whether it is a pmd fault
> + * @flags:   iomap flags
> + * @iomap:   from iomap_begin()
> + * @srcmap:  from iomap_begin(), not equal to iomap if it is a CoW
> + */
> +static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
> + struct xa_state *xas, void **entry, bool pmd,
> + unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
> + loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> + bool write = vmf->flags & FAULT_FLAG_WRITE;
> + bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
> + unsigned long entry_flags = pmd ? DAX_PMD : 0;
> + int err = 0;
> + pfn_t pfn;
> +
> + /* if we are reading UNWRITTEN and HOLE, return a hole. */
> + if (!write &&
> + (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
> + if (!pmd)
> + return dax_load_hole(xas, mapping, entry, vmf);
> + else
> + return dax_pmd_load_hole(xas, vmf, iomap, entry);
> + }
> +
> + if (iomap->type != IOMAP_MAPPED) {
> + WARN_ON_ONCE(1);
> + return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
> + }
> +
> + err = dax_iomap_pfn(iomap, pos, size, &pfn);
> + if (err)
> + return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
> +
> + *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
> +   write && !sync);
> +
> + if (sync)
> +

Re: [PATCH v2] generic: test fiemap offsets and < 512 byte ranges

2021-04-07 Thread Darrick J. Wong

On Tue, Apr 06, 2021 at 03:54:29PM -0700, Boris Burkov wrote:
> btrfs trims fiemap extents to the inputted offset, which leads to
> inconsistent results for most inputs, and downright bizarre outputs like
> [7..6] when the trimmed extent is at the end of an extent and shorter
> than 512 bytes.
> 
> This test covers a bunch of cases like that and ensures that file
> systems always return the full extent without trimming it.
> 
> I also ran it under ext2, ext3, ext4, f2fs, and xfs successfully, but I
> suppose it's no guarantee that every file system will store a 4k synced
> write in a single extent. For that reason, this might be a bit fragile.

Does it work with 64k fs blocks?  Or 512b blocks? :)

Also... is there an xfs_io fix to go with this?

> 
> This test is fixed for btrfs by:
> btrfs: return whole extents in fiemap
> (https://lore.kernel.org/linux-btrfs/274e5bcebdb05a8969fc300b4802f33da2fbf218.1617746680.git.bo...@bur.io/)
> 
> Signed-off-by: Boris Burkov 
> 
> --
> v2: fill out copyright and test description
> 
> ---
>  tests/generic/623 | 94 +++
>  tests/generic/623.out |  2 +
>  tests/generic/group   |  1 +
>  3 files changed, 97 insertions(+)
>  create mode 100755 tests/generic/623
>  create mode 100644 tests/generic/623.out
> 
> diff --git a/tests/generic/623 b/tests/generic/623
> new file mode 100755
> index ..85ef68f6
> --- /dev/null
> +++ b/tests/generic/623
> @@ -0,0 +1,94 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2021 Facebook.  All Rights Reserved.
> +#
> +# FS QA Test 623
> +#
> +# Test fiemaps with offsets into small parts of extents.
> +# Expect to get the whole extent, anyway.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_require_test
> +_require_scratch
> +_require_xfs_io_command "fiemap"
> +
> +rm -f $seqres.full
> +
> +_do_fiemap() {
> + off=$1
> + len=$2
> + $XFS_IO_PROG -c "fiemap $off $len" $SCRATCH_MNT/foo
> +}
> +
> +_check_fiemap() {

Only helper functions in common/ need to be prefixed with "_".

> + off=$1
> + len=$2
> + actual=$(_do_fiemap $off $len | tee -a $seqres.full)
> + [ "$actual" == "$expected" ] || _fail "unexpected fiemap on $off $len"
> +}
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount

You could probably accomplish this by creating a file on the test
device, which means this test would run on configurations where there's
no scratch device; and run faster since there's no longer any need for
mkfs.

--D

> +
> +# write a file with one extent
> +$XFS_IO_PROG -f -s -c "pwrite -S 0xcf 0 4K" $SCRATCH_MNT/foo >/dev/null
> +
> +# since the exact extent location is unpredictable especially when
> +# varying file systems, just test that they are all equal, which is
> +# what we really expect.
> +expected=$(_do_fiemap)
> +
> +# start to mid-extent
> +_check_fiemap 0 2048
> +# start to end
> +_check_fiemap 0 4096
> +# start to past-end
> +_check_fiemap 0 4097
> +# mid-extent to mid-extent
> +_check_fiemap 1024 2048
> +# mid-extent to end
> +_check_fiemap 2048 4096
> +# mid-extent to past-end
> +_check_fiemap 2048 4097
> +
> +# to end; len < 512
> +_check_fiemap 4091 5
> +# to end; len == 512
> +_check_fiemap 3584 512
> +# past end; len < 512
> +_check_fiemap 4091 500
> +# past end; len == 512
> +_check_fiemap 4091 512
> +
> +_scratch_unmount
> +
> +echo "Silence is golden"
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/generic/623.out b/tests/generic/623.out
> new file mode 100644
> index ..6f774f19
> --- /dev/null
> +++ b/tests/generic/623.out
> @@ -0,0 +1,2 @@
> +QA output created by 623
> +Silence is golden
> diff --git a/tests/generic/group b/tests/generic/group
> index b10fdea4..39e02383 100644
> --- a/tests/generic/group
> +++ b/tests/generic/group
> @@ -625,3 +625,4 @@
>  620 auto mount quick
>  621 auto quick encrypt
>  622 auto shutdown metadata atime
> +623 auto quick
> -- 
> 2.30.2
>

[GIT PULL] iomap: fixes for 5.12-rc4

2021-03-18 Thread Darrick J. Wong

Hi Linus,

Please pull this single fix to the iomap code for 5.12-rc4, which fixes
some drama when someone gives us a {de,ma}liciously fragmented swap
file.

The branch merges cleanly with upstream as of a few minutes ago and has
been soaking in for-next for a week without complaints.  Please let me
know if there are any strange problems.

--D

The following changes since commit a38fd8748464831584a19438cbb3082b5a2dab15:

  Linux 5.12-rc2 (2021-03-05 17:33:41 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git iomap-5.12-fixes

for you to fetch changes up to 5808fecc572391867fcd929662b29c12e6d08d81:

  iomap: Fix negative assignment to unsigned sis->pages in 
iomap_swapfile_activate (2021-03-09 09:29:11 -0800)


Ritesh Harjani (1):
  iomap: Fix negative assignment to unsigned sis->pages in 
iomap_swapfile_activate

 fs/iomap/swapfile.c | 10 ++
 1 file changed, 10 insertions(+)

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-04 Thread Darrick J. Wong

On Tue, Mar 02, 2021 at 09:49:30AM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner  wrote:
> >
> > On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> > > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong  wrote:
> > > > > > I really don't see you seem to be telling us that invalidation is an
> > > > > > either/or choice. There's more ways to convert physical block
> > > > > > address -> inode file offset and mapping index than brute force
> > > > > > inode cache walks
> > > > >
> > > > > Yes, but I was trying to map it to an existing mechanism and the
> > > > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > > > needs to happen here.
> > > >
> > > > Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> > > > that function except for the invalidate_mapping_* call itself.  The goal
> > > > of this series is first to wire up a callback within both the block and
> > > > pmem subsystems so that they can take notifications and reverse-map them
> > > > through the storage stack until they reach an fs superblock.
> > >
> > > I'm chuckling because this "reverse map all the way up the block
> > > layer" is the opposite of what Dave said at the first reaction to my
> > > proposal, "can't the mm map pfns to fs inode  address_spaces?".
> >
> > Ah, no, I never said that the filesystem can't do reverse maps. I
> > was asking if the mm could directly (brute-force) invalidate PTEs
> > pointing at physical pmem ranges without needing walk the inode
> > mappings. That would be far more efficient if it could be done

So, uh, /can/ the kernel brute-force invalidate PTEs when the pmem
driver says that something died?  Part of what's keeping me from putting
together a coherent vision for how this would work is my relative
unfamiliarity with all things mm/.

> > > Today whenever the pmem driver receives new corrupted range
> > > notification from the lower level nvdimm
> > > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> > > associated with the pmem gendisk and then notifies userspace that
> > > there are new badblocks. This seems a perfect place to signal an upper
> > > level stacked block device that may also be watching disk->bb. Then
> > > each gendisk in a stacked topology is responsible for watching the
> > > badblock notifications of the next level and storing a remapped
> > > instance of those blocks until ultimately the filesystem mounted on
> > > the top-level block device is responsible for registering for those
> > > top-level disk->bb events.
> > >
> > > The device gone notification does not map cleanly onto 'struct badblocks'.
> >
> > Filesystems are not allowed to interact with the gendisk
> > infrastructure - that's for supporting the device side of a block
> > device. It's a layering violation, and many a filesytem developer
> > has been shouted at for trying to do this. At most we can peek
> > through it to query functionality support from the request queue,
> > but otherwise filesystems do not interact with anything under
> > bdev->bd_disk.
> 
> So lets add an api that allows the querying of badblocks by bdev and
> let the block core handle the bd_disk interaction. I see other block
> functionality like blk-integrity reaching through gendisk. The fs need
> not interact with the gendisk directly.

(I thought it was ok for block code to fiddle with other block
internals, and it's filesystems messing with block internals that was
prohibited?)

> > As it is, badblocks are used by devices to manage internal state.
> > e.g. md for recording stripes that need recovery if the system
> > crashes while they are being written out.
> 
> I know, I was there when it was invented which is why it was
> top-of-mind when pmem had a need to communicate badblocks. Other block
> drivers have threatened to use it for badblocks tracking, but none of
> those have carried through on that initial interest.

I hadn't realized that badblocks was bolted onto gendisk nowadays, I
mistakenly thought it was still something internal to md.

Looking over badblocks, I see a major drawback in that it can only
remember a single page's worth of badblocks records.

> > > If an upper level agent really cared about knowing about ->remove()
> > > events before they happened it could maybe do something like:

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-02 Thread Darrick J. Wong

On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
> 
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
> 
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
> 
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

I can't tell if we're conflating the "a bunch of your pmem went bad"
case with the "all your dimms fell out of the machine" case.

If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
of pmem, I certainly want that level of specificity.  Just notify the
users of the dead piece, don't flush the whole machine down the drain.

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks
> 
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

Yes.  XFS (with rmap enabled) can do all the iteration and walking in
that function except for the invalidate_mapping_* call itself.  The goal
of this series is first to wire up a callback within both the block and
pmem subsystems so that they can take notifications and reverse-map them
through the storage stack until they reach an fs superblock.

Once the information has reached XFS, it can use its own reverse
mappings to figure out which pages of which inodes are now targetted.
The future of DAX hw error handling can be that you throw the spitwad at
us, and it's our problem to distill that into mm invalidation calls.
XFS' reverse mapping data is indexed by storage location and isn't
sharded by address_space, so (except for the DIMMs falling out), we
don't need to walk the entire inode list or scan the entire mapping.

Between XFS and DAX and mm, the mm already has the invalidation calls,
xfs already has the distiller, and so all we need is that first bit.
The current mm code doesn't fully solve the problem, nor does it need
to, since it handles DRAM errors acceptably* already.

* Actually, the hwpoison code should _also_ be calling ->corrupted_range
when DRAM goes bad so that we can detect metadata failures and either
reload the buffer or (if it was dirty) shut down.

> >
> > .
> >
> > > > IOWs, what needs to happen at this point is very filesystem
> > > > specific. Assuming that "device unplug == filesystem dead" is not
> > > > correct, nor is specifying a generic action that assumes the
> > > > filesystem is dead because a device it is using went away.
> > >
> > > Ok, I think I set this discussion in the wrong direction implying any
> > > mapping of this action to a "filesystem dead" event. It's just a "zap
> > > all ptes" event and upper layers recover from there.
> >
> > Yes, that's exactly what ->corrupt_range() is intended for. It
> > allows the filesystem to lock out access to the bad range
> > and then recove

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Darrick J. Wong

On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> Hi, guys
> 
> Beside this patchset, I'd like to confirm something about the
> "EXPERIMENTAL" tag for dax in XFS.
> 
> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> when we mount a pmem device with dax option, has been existed for a
> while.  It's a bit annoying when using fsdax feature.  So, my initial
> intention was to remove this tag.  And I started to find out and solve
> the problems which prevent it from being removed.
> 
> As is talked before, there are 3 main problems.  The first one is "dax
> semantics", which has been resolved.  The rest two are "RMAP for
> fsdax" and "support dax reflink for filesystem", which I have been
> working on.  



> So, what I want to confirm is: does it means that we can remove the
> "EXPERIMENTAL" tag when the rest two problem are solved?

Yes.  I'd keep the experimental tag for a cycle or two to make sure that
nothing new pops up, but otherwise the two patchsets you've sent close
those two big remaining gaps.  Thank you for working on this!

> Or maybe there are other important problems need to be fixed before
> removing it?  If there are, could you please show me that?

That remains to be seen through QA/validation, but I think that's it.

Granted, I still have to read through the two patchsets...

--D

> 
> Thank you.
> 
> 
> --
> Ruan Shiyang.

Re: [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files

2021-02-25 Thread Darrick J. Wong

On Fri, Feb 26, 2021 at 08:20:27AM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/iomap/apply.c  | 51 +++
>  include/linux/iomap.h |  7 +-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..fd2f8bde5791 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,54 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>  
>   return written ? written : ret;
>  }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t 
> pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret;
> +
> + ret = ops->iomap_begin(ino1, pos1, length, 0, &smap, NULL);
> + if (ret)
> + goto out_src;
> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> +
> + ret = ops->iomap_begin(ino2, pos2, length, 0, &dmap, NULL);
> + if (ret)
> + goto out_dest;
> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> +
> + /* make sure extent length of two file is equal */
> + if (WARN_ON(smap.length != dmap.length)) {

Why not set smap.length and dmap.length to min(smap.length, dmap.length) ?

--D

> + written = -EIO;
> + goto out_dest;
> + }
> +
> + written = actor(ino1, pos1, ino2, pos2, length, data, &smap, &dmap);
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino2, pos2, length, 0, 0, &dmap);
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, length, 0, 0, &smap);
> +
> + return ret;
> +}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5bd3cac4df9c..913f98897a77 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -148,10 +148,15 @@ struct iomap_ops {
>   */
>  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
>   void *data, struct iomap *iomap, struct iomap *srcmap);
> -
> +typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap);
>  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
>   unsigned flags, const struct iomap_ops *ops, void *data,
>   iomap_actor_t actor);
> +loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
> + loff_t pos2, loff_t length, unsigned int flags,
> + const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
>  
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>   const struct iomap_ops *ops);
> -- 
> 2.30.1
> 
> 
>

[GIT PULL] iomap: new code for 5.12-rc1

2021-02-18 Thread Darrick J. Wong

Hi Linus,

Please pull these new changes to the iomap code for 5.12.  The big
change in this cycle is some new code to make it possible for XFS to try
unaligned directio overwrites without taking locks.  If the block is
fully written and within EOF (i.e. doesn't require any further fs
intervention) then we can let the unlocked write proceed.  If not, we
fall back to synchronizing direct writes.

Note that the btrfs developers have been working on supporting zoned
block devices, and their 5.12 pull request has a single iomap patch to
adjust directio writes to support REQ_OP_APPEND.

The branch merges cleanly with 5.11 and has been soaking in for-next for
quite a while now.  Please let me know if there are any strange
problems.  It's been a pretty quiet cycle, so I don't anticipate any
more iomap pulls other than whatever new bug fixes show up.

--D (whose pull requests are delayed by last weekend's wild ride :( )

The following changes since commit 19c329f6808995b142b3966301f217c831e7cf31:

  Linux 5.11-rc4 (2021-01-17 16:37:05 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git tags/iomap-5.12-merge-2

for you to fetch changes up to ed1128c2d0c87e5ff49c40f5529f06bc35f4251b:

  xfs: reduce exclusive locking on unaligned dio (2021-02-01 09:47:19 -0800)


New code for 5.12:
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN if
  the write is not a pure overwrite within EOF; this will be used to
  reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for unaligned
  direct writes if the circumstances permit


Christoph Hellwig (9):
  iomap: rename the flags variable in __iomap_dio_rw
  iomap: pass a flags argument to iomap_dio_rw
  iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
  xfs: factor out a xfs_ilock_iocb helper
  xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
  xfs: cleanup the read/write helper naming
  xfs: remove the buffered I/O fallback assert
  xfs: simplify the read/write tracepoints
  xfs: improve the reflink_bounce_dio_write tracepoint

Dave Chinner (2):
  xfs: split the unaligned DIO write code out
  xfs: reduce exclusive locking on unaligned dio

 fs/btrfs/file.c   |   7 +-
 fs/ext4/file.c|   5 +-
 fs/gfs2/file.c|   7 +-
 fs/iomap/direct-io.c  |  26 ++--
 fs/xfs/xfs_file.c | 351 --
 fs/xfs/xfs_iomap.c|  29 +++--
 fs/xfs/xfs_trace.h|  22 ++--
 fs/zonefs/super.c |   4 +-
 include/linux/iomap.h |  18 ++-
 9 files changed, 269 insertions(+), 200 deletions(-)

Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-18 Thread Darrick J. Wong

On Wed, Feb 17, 2021 at 11:24:18AM +0800, Ruan Shiyang wrote:
> 
> 
> On 2021/2/10 下午9:19, Christoph Hellwig wrote:
> > On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:
> > > 
> > > 
> > > On 2021/2/9 下午5:34, Christoph Hellwig wrote:
> > > > On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
> > > > > The dax dedupe comparison need the iomap_ops pointer as argument, so 
> > > > > my
> > > > > understanding is that we don't modify the argument list of
> > > > > generic_remap_file_range_prep(), but move its code into
> > > > > __generic_remap_file_range_prep() whose argument list can be modified 
> > > > > to
> > > > > accepts the iomap_ops pointer.  Then it looks like this:
> > > > 
> > > > I'd say just add the iomap_ops pointer to
> > > > generic_remap_file_range_prep and do away with the extra wrappers.  We
> > > > only have three callers anyway.
> > > 
> > > OK.
> > 
> > So looking at this again I think your proposal actaully is better,
> > given that the iomap variant is still DAX specific.  Sorry for
> > the noise.
> > 
> > Also I think dax_file_range_compare should use iomap_apply instead
> > of open coding it.
> > 
> 
> There are two files, which are not reflinked, need to be direct_access()
> here.  The iomap_apply() can handle one file each time.  So, it seems that
> iomap_apply() is not suitable for this case...
> 
> 
> The pseudo code of this process is as follows:
> 
>   srclen = ops->begin(&srcmap)
>   destlen = ops->begin(&destmap)
> 
>   direct_access(&srcmap, &saddr)
>   direct_access(&destmap, &daddr)
> 
>   same = memcpy(saddr, daddr, min(srclen,destlen))
> 
>   ops->end(&destmap)
>   ops->end(&srcmap)
> 
> I think a nested call like this is necessary.  That's why I use the open
> code way.

This might be a good place to implement an iomap_apply2() loop that
actually /does/ walk all the extents of file1 and file2.  There's now
two users of this idiom.

(Possibly structured as a "get next mappings from both" generator
function like Matthew Wilcox keeps asking for. :))

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> > 
> 
>

Re: Unexpected reflink/subvol snapshot behaviour

2021-02-01 Thread Darrick J. Wong

On Fri, Jan 22, 2021 at 09:20:51AM +1100, Dave Chinner wrote:
> Hi btrfs-gurus,
> 
> I'm running a simple reflink/snapshot/COW scalability test at the
> moment. It is just a loop that does "fio overwrite of 10,000 4kB
> random direct IOs in a 4GB file; snapshot" and I want to check a
> couple of things I'm seeing with btrfs. fio config file is appended
> to the email.
> 
> Firstly, what is the expected "space amplification" of such a
> workload over 1000 iterations on btrfs? This will write 40GB of user
> data, and I'm seeing btrfs consume ~220GB of space for the workload
> regardless of whether I use subvol snapshot or file clones
> (reflink).  That's a space amplification of ~5.5x (a lot!) so I'm
> wondering if this is expected or whether there's something else
> going on. XFS amplification for 1000 iterations using reflink is
> only 1.4x, so 5.5x seems somewhat excessive to me.
> 
> On a similar note, the IO bandwidth consumed by btrfs is way out of
> proportion with the amount of user data being written. I'm seeing
> multiple GBs being written by btrfs on every iteration - easily
> exceeding 5GB of writes per cycle in the later iterations of the
> test. Given that only 40MB of user data is being written per cycle,
> there's a write amplification factor of well over 100x ocurring
> here. In comparison, XFS is writing roughly consistently at 80MB/s
> to disk over the course of the entire workload, largely because of
> journal traffic for the transactions run during COW and clone
> operations.  Is such a huge amount of of IO expected for btrfs in
> this situation?



> FYI, I've compared btrfs reflink to XFS reflink, too, and XFS fio
> performance stays largely consistent across all 1000 iterations at
> around 13-14k +/-2k IOPS. The reflink time also scales linearly with
> the number of extents in the source file and levels off at about
> 10-11s per cycle as the extent count in the source file levels off
> at ~850,000 extents. XFS completes the 1000 iterations of
> write/clone in about 4 hours, btrfs completels the same part of the
> workload in about 9 hours.

Just out of curiosity, do any of the patches in [1] improve those
numbers for xfs?  As you noted a long time ago, the transaction
reservations are kind of huge, so I fixed those and shook out a few
other warts while I was at it.

--D

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=reflink-speedups
> 
> Oh, I almost forget - FIEMAP performance. After the reflink test, I
> map all the extents in all the cloned files to a) count the extents
> and b) confirm that the difference between clones is correct (~1
> extents not shared with the previous iteration). Pulling the extent
> maps out of XFS takes about 3s a clone (~850,000 extents), or 30
> minutes for the whole set when run serialised. btrfs takes 90-100s
> per clone - after 8 hours it had only managed to map 380 files and
> was running at 6-7000 read IOPS the entire time. IOWs, it was taking
> _half a million_ read IOs to map the extents of a single clone that
> only had a million extents in it. Is it expected that FIEMAP is so
> slow and IO intensive on cloned files?
> 
> As there are no performance anomolies or memory reclaim issues with
> XFS running this workload, I suspect the issues I note above are
> btrfs issues, not expected behaviour.  I'm not sure what the
> expected scalability of btrfs file clones and snapshots are though,
> so I'm interested to hear if these results are expected or not.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> JOBS=4
> IODEPTH=4
> IOCOUNT=$((1 / $JOBS))
> FILESIZE=4g
> 
> cat >$fio_config < [global]
> name=${DST}.name
> directory=${DST}
> size=${FILESIZE}
> randrepeat=0
> bs=4k
> ioengine=libaio
> iodepth=${IODEPTH}
> iodepth_low=2
> direct=1
> end_fsync=1
> fallocate=none
> overwrite=1
> number_ios=${IOCOUNT}
> runtime=30s
> group_reporting=1
> disable_lat=1
> lat_percentiles=0
> clat_percentiles=0
> slat_percentiles=0
> disk_util=0
> 
> [j1]
> filename=testfile
> rw=randwrite
> 
> [j2]
> filename=testfile
> rw=randwrite
> 
> [j3]
> filename=testfile
> rw=randwrite
> 
> [j4]
> filename=testfile
> rw=randwrite
> EOF
>

Re: [PATCH v11 02/40] iomap: support REQ_OP_ZONE_APPEND

2021-01-04 Thread Darrick J. Wong

On Tue, Dec 22, 2020 at 12:48:55PM +0900, Naohiro Aota wrote:
> A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
> max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
> such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
> REQ_OP_ZONE_APPEND.
> 
> To utilize it, we need to set the bio_op before calling
> bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
> that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
> and restricted bio.
> 
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Naohiro Aota 

Looks fine to me too,
Reviewed-by: Darrick J. Wong 

Do the authors intend this one patch to go into 5.12 via the iomap tree?

--D

> ---
>  fs/iomap/direct-io.c  | 43 +--
>  include/linux/iomap.h |  1 +
>  2 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 933f234d5bec..2273120d8ed7 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -201,6 +201,34 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap 
> *iomap, loff_t pos,
>   iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
> +/*
> + * Figure out the bio's operation flags from the dio request, the
> + * mapping, and whether or not we want FUA.  Note that we can end up
> + * clearing the WRITE_FUA flag in the dio request.
> + */
> +static inline unsigned int
> +iomap_dio_bio_opflags(struct iomap_dio *dio, struct iomap *iomap, bool 
> use_fua)
> +{
> + unsigned int opflags = REQ_SYNC | REQ_IDLE;
> +
> + if (!(dio->flags & IOMAP_DIO_WRITE)) {
> + WARN_ON_ONCE(iomap->flags & IOMAP_F_ZONE_APPEND);
> + return REQ_OP_READ;
> + }
> +
> + if (iomap->flags & IOMAP_F_ZONE_APPEND)
> + opflags |= REQ_OP_ZONE_APPEND;
> + else
> + opflags |= REQ_OP_WRITE;
> +
> + if (use_fua)
> + opflags |= REQ_FUA;
> + else
> + dio->flags &= ~IOMAP_DIO_WRITE_FUA;
> +
> + return opflags;
> +}
> +
>  static loff_t
>  iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
>   struct iomap_dio *dio, struct iomap *iomap)
> @@ -208,6 +236,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   unsigned int blkbits = 
> blksize_bits(bdev_logical_block_size(iomap->bdev));
>   unsigned int fs_block_size = i_blocksize(inode), pad;
>   unsigned int align = iov_iter_alignment(dio->submit.iter);
> + unsigned int bio_opf;
>   struct bio *bio;
>   bool need_zeroout = false;
>   bool use_fua = false;
> @@ -263,6 +292,13 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   iomap_dio_zero(dio, iomap, pos - pad, pad);
>   }
>  
> + /*
> +  * Set the operation flags early so that bio_iov_iter_get_pages
> +  * can set up the page vector appropriately for a ZONE_APPEND
> +  * operation.
> +  */
> + bio_opf = iomap_dio_bio_opflags(dio, iomap, use_fua);
> +
>   do {
>   size_t n;
>   if (dio->error) {
> @@ -278,6 +314,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   bio->bi_ioprio = dio->iocb->ki_ioprio;
>   bio->bi_private = dio;
>   bio->bi_end_io = iomap_dio_bio_end_io;
> + bio->bi_opf = bio_opf;
>  
>   ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
>   if (unlikely(ret)) {
> @@ -293,14 +330,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>  
>   n = bio->bi_iter.bi_size;
>   if (dio->flags & IOMAP_DIO_WRITE) {
> - bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
> - if (use_fua)
> - bio->bi_opf |= REQ_FUA;
> - else
> - dio->flags &= ~IOMAP_DIO_WRITE_FUA;
>   task_io_account_write(n);
>   } else {
> - bio->bi_opf = REQ_OP_READ;
>   if (dio->flags & IOMAP_DIO_DIRTY)
>   bio_set_pages_dirty(bio);
>   }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5bd3cac4df9c..8ebb1fa6f3b7 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -55,6 +55,7 @@ struct vm_fault;
>  #define IOMAP_F_SHARED   0x04
>  #define IOMAP_F_MERGED   0x08
>  #define IOMAP_F_BUFFER_HEAD  0x10
> +#define IOMAP_F_ZONE_APPEND  0x20
>  
>  /*
>   * Flags set by the core iomap code during operations:
> -- 
> 2.27.0
>

Re: [PATCH 2/2] btrfs: Make btrfs_direct_write atomic with respect to inode_lock

2020-12-15 Thread Darrick J. Wong

On Tue, Dec 15, 2020 at 12:06:36PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> btrfs_direct_write() fallsback to buffered write in case btrfs is not
> able to perform or complete a direct I/O. During the fallback
> inode lock is unlocked and relocked. This does not guarantee the
> atomicity of the entire write since the lock can be acquired by another
> write between unlock and relock.
> 
> __btrfs_buffered_write() is used to perform the direct fallback write,
> which performs the write without acquiring the lock or checks.

Er... can you grab the inode lock before deciding which of the IO
path(s) you're going to take?  Then you'd always have an atomic write
even if fallback happens.

(Also vaguely wondering why this needs even more slicing and dicing of
the iomap directio functions...)

--D

> 
> fa54fc76db94 ("btrfs: push inode locking and unlocking into buffered/direct 
> write")
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/file.c | 69 -
>  1 file changed, 40 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0e41459b8de6..9fc768b951f1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1638,11 +1638,11 @@ static int btrfs_write_check(struct kiocb *iocb, 
> struct iov_iter *from,
>   return 0;
>  }
>  
> -static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> +static noinline ssize_t __btrfs_buffered_write(struct kiocb *iocb,
>  struct iov_iter *i)
>  {
>   struct file *file = iocb->ki_filp;
> - loff_t pos;
> + loff_t pos = iocb->ki_pos;
>   struct inode *inode = file_inode(file);
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
>   struct page **pages = NULL;
> @@ -1656,24 +1656,9 @@ static noinline ssize_t btrfs_buffered_write(struct 
> kiocb *iocb,
>   bool only_release_metadata = false;
>   bool force_page_uptodate = false;
>   loff_t old_isize = i_size_read(inode);
> - unsigned int ilock_flags = 0;
> -
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - ilock_flags |= BTRFS_ILOCK_TRY;
> -
> - ret = btrfs_inode_lock(inode, ilock_flags);
> - if (ret < 0)
> - return ret;
> -
> - ret = generic_write_checks(iocb, i);
> - if (ret <= 0)
> - goto out;
>  
> - ret = btrfs_write_check(iocb, i, ret);
> - if (ret < 0)
> - goto out;
> + lockdep_assert_held(&inode->i_rwsem);
>  
> - pos = iocb->ki_pos;
>   nrptrs = min(DIV_ROUND_UP(iov_iter_count(i), PAGE_SIZE),
>   PAGE_SIZE / (sizeof(struct page *)));
>   nrptrs = min(nrptrs, current->nr_dirtied_pause - current->nr_dirtied);
> @@ -1877,10 +1862,37 @@ static noinline ssize_t btrfs_buffered_write(struct 
> kiocb *iocb,
>   iocb->ki_pos += num_written;
>   }
>  out:
> - btrfs_inode_unlock(inode, ilock_flags);
>   return num_written ? num_written : ret;
>  }
>  
> +static noinline ssize_t btrfs_buffered_write(struct kiocb *iocb,
> +struct iov_iter *i)
> +{
> + struct inode *inode = file_inode(iocb->ki_filp);
> + unsigned int ilock_flags = 0;
> + ssize_t ret;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + ilock_flags |= BTRFS_ILOCK_TRY;
> +
> + ret = btrfs_inode_lock(inode, ilock_flags);
> + if (ret < 0)
> + return ret;
> +
> + ret = generic_write_checks(iocb, i);
> + if (ret <= 0)
> + goto out;
> +
> + ret = btrfs_write_check(iocb, i, ret);
> + if (ret < 0)
> + goto out;
> +
> + ret = __btrfs_buffered_write(iocb, i);
> +out:
> + btrfs_inode_unlock(inode, ilock_flags);
> + return ret;
> +}
> +
>  static ssize_t check_direct_IO(struct btrfs_fs_info *fs_info,
>  const struct iov_iter *iter, loff_t offset)
>  {
> @@ -1927,10 +1939,8 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, 
> struct iov_iter *from)
>   }
>  
>   err = btrfs_write_check(iocb, from, err);
> - if (err < 0) {
> - btrfs_inode_unlock(inode, ilock_flags);
> + if (err < 0)
>   goto out;
> - }
>  
>   pos = iocb->ki_pos;
>   /*
> @@ -1944,22 +1954,19 @@ static ssize_t btrfs_direct_write(struct kiocb *iocb, 
> struct iov_iter *from)
>   goto relock;
>   }
>  
> - if (check_direct_IO(fs_info, from, pos)) {
> - btrfs_inode_unlock(inode, ilock_flags);
> + if (check_direct_IO(fs_info, from, pos))
>   goto buffered;
> - }
>  
>   dio = __iomap_dio_rw(iocb, from, &btrfs_dio_iomap_ops,
>&btrfs_dio_ops, is_sync_kiocb(iocb));
>  
> - btrfs_inode_unlock(inode, ilock_flags);
> -
>   if (IS_ERR_OR_NULL(dio)) {
>   err = PTR_ERR_OR_ZERO(dio);
>   if (err < 0 && err != -ENOTBLK)
>   goto out;
>   }

Re: [PATCH 3/4] btrfs: exclude mmaps while doing remap

2020-12-15 Thread Darrick J. Wong

On Mon, Dec 14, 2020 at 01:19:40PM -0500, Josef Bacik wrote:
> Darrick reported a potential issue to me where we could allow mmap
> writes after validating a page range matched in the case of dedupe.
> Generally we rely on lock page -> lock extent with the ordered flush to
> protect us, but this is done after we check the pages because we use the
> generic helpers, so we could modify the page in between doing the check
> and locking the range.

FWIW I only found that via code inspection because Matthew Wilcox asked
me about whether or not filesystems did the right thing.  I wrote the
attached fstest to try to demonstrate the problem on btrfs but it
actually passes on xfs and btrfs.  This means either that (a) btrfs is
doing it right through some other means because I don't understand its
locking or (b) the test is wrong.

The test /does/ explode as expected on ocfs2 because mmap io lol there.

--D

From: Darrick J. Wong 
Subject: [PATCH] generic: test file writers racing with FIDEDUPERANGE

Create a test to make sure that dedupe actually locks the file ranges
correctly before starting the content comparison and keeps them locked
until the operation completes.

Signed-off-by: Darrick J. Wong 
---
 src/Makefile  |2 
 src/deduperace.c  |  370 +
 tests/generic/949 |   51 +++
 tests/generic/949.out |2 
 tests/generic/group   |1 
 5 files changed, 425 insertions(+), 1 deletion(-)
 create mode 100644 src/deduperace.c
 create mode 100755 tests/generic/949
 create mode 100644 tests/generic/949.out

diff --git a/src/Makefile b/src/Makefile
index 5b9781c5..2875e8ef 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -21,7 +21,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
-   locktest unwritten_mmap bulkstat_unlink_test \
+   locktest unwritten_mmap bulkstat_unlink_test deduperace \
bulkstat_unlink_test_modified t_dir_offset t_futimens t_immutable \
stale_handle pwrite_mmap_blocked t_dir_offset2 seek_sanity_test \
seek_copy_test t_readdir_1 t_readdir_2 fsync-tester nsexec cloner \
diff --git a/src/deduperace.c b/src/deduperace.c
new file mode 100644
index ..88c0b6b9
--- /dev/null
+++ b/src/deduperace.c
@@ -0,0 +1,370 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020 Oracle.
+ * Author: Darrick J. Wong 
+ *
+ * Race pwrite/mwrite with dedupe to see if we got the locking right.
+ *
+ * File writes and mmap writes should not be able to change the src_fd's
+ * contents after dedupe prep has verified that the file contents are the same.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define GOOD_BYTE  0x58
+#define BAD_BYTE   0x66
+
+#ifndef FIDEDUPERANGE
+/* extent-same (dedupe) ioctls; these MUST match the btrfs ioctl definitions */
+#define FILE_DEDUPE_RANGE_SAME 0
+#define FILE_DEDUPE_RANGE_DIFFERS  1
+
+/* from struct btrfs_ioctl_file_extent_same_info */
+struct file_dedupe_range_info {
+   __s64 dest_fd;  /* in - destination file */
+   __u64 dest_offset;  /* in - start of extent in destination */
+   __u64 bytes_deduped;/* out - total # of bytes we were able
+* to dedupe from this file. */
+   /* status of this dedupe operation:
+* < 0 for error
+* == FILE_DEDUPE_RANGE_SAME if dedupe succeeds
+* == FILE_DEDUPE_RANGE_DIFFERS if data differs
+*/
+   __s32 status;   /* out - see above description */
+   __u32 reserved; /* must be zero */
+};
+
+/* from struct btrfs_ioctl_file_extent_same_args */
+struct file_dedupe_range {
+   __u64 src_offset;   /* in - start of extent in source */
+   __u64 src_length;   /* in - length of extent */
+   __u16 dest_count;   /* in - total elements in info array */
+   __u16 reserved1;/* must be zero */
+   __u32 reserved2;/* must be zero */
+   struct file_dedupe_range_info info[0];
+};
+#define FIDEDUPERANGE  _IOWR(0x94, 54, struct file_dedupe_range)
+#endif /* FIDEDUPERANGE */
+
+static int fd1, fd2;
+static loff_t offset = 37; /* Nice low offset to trick the compare */
+static loff_t blksz;
+
+/* Continuously dirty the pagecache for the region being dupe-tested. */
+void *
+mwriter(
+   void*data)
+{
+   volatile char   *p;
+
+   p = mmap(NULL, blksz, PROT_WRITE, MAP_SHARED, fd1, 0);
+   if (p == MAP_FAILED) {
+   perror("mmap");
+   exit(2);
+   }
+
+   while (1) {
+   *(p + offset) = BAD_BYTE;
+   *(p + offset) = GOOD_BYTE;
+   }
+}
+
+/* Contin

Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Darrick J. Wong

On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> On 2019-10-21, Darrick J. Wong  wrote:
> > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval 
> > > 
> > > Btrfs supports transparent compression: data written by the user can be
> > > compressed when written to disk and decompressed when read back.
> > > However, we'd like to add an interface to write pre-compressed data
> > > directly to the filesystem, and the matching interface to read
> > > compressed data without decompressing it. This adds support for
> > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > 
> > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > is used for metadata: namely, the compression algorithm, unencoded
> > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > should be used (needed for truncated or hole-punched extents and when
> > > reading in the middle of an extent). For reads, the filesystem returns
> > > this information; for writes, the caller provides it to the filesystem.
> > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > used to extend the interface in the future. The remaining iovecs contain
> > > the encoded extent.
> > > 
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > > 
> > > Signed-off-by: Omar Sandoval 
> > > ---
> > >  include/linux/fs.h  | 14 +++
> > >  include/uapi/linux/fs.h | 26 -
> > >  mm/filemap.c| 82 ++---
> > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index e0d909d35763..54681f21e05e 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> > > offset,
> > >  /* File does not contribute to nr_files count */
> > >  #define FMODE_NOACCOUNT  ((__force fmode_t)0x2000)
> > >  
> > > +/* File supports encoded IO */
> > > +#define FMODE_ENCODED_IO ((__force fmode_t)0x4000)
> > > +
> > >  /*
> > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > >   * that indicates that they should check the contents of the iovec are
> > > @@ -314,6 +317,7 @@ enum rw_hint {
> > >  #define IOCB_SYNC(1 << 5)
> > >  #define IOCB_WRITE   (1 << 6)
> > >  #define IOCB_NOWAIT  (1 << 7)
> > > +#define IOCB_ENCODED (1 << 8)
> > >  
> > >  struct kiocb {
> > >   struct file *ki_filp;
> > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, 
> > > int);
> > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > >  extern int generic_file_readonly_mmap(struct file *, struct 
> > > vm_area_struct *);
> > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > +struct encoded_iov;
> > > +extern int generic_encoded_write_checks(struct kiocb *, struct 
> > > encoded_iov *);
> > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > + struct iov_iter *);
> > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > >   struct file *file_out, loff_t pos_out,
> > >   loff_t *count, unsigned int remap_flags);
> > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb 
> > > *ki, rwf_t flags)
> > >   return -EOPNOTSUPP;
> > >   ki->ki_flags |= IOCB_NOWAIT;
> > >   }
> > > + if (flags & RWF_ENCODED) {
> > > + if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > + return -EOPNOTSUPP;
> > > + ki->ki_flags |= IOCB_ENCODED;
> > > + }
> > >   if (flags & RWF_HIPRI)
> > >   ki->ki_flags |= IOCB_HIPRI;
> > >   if (flags & RWF_DSYNC)
> > > diff --git a/include/uapi/

Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Darrick J. Wong

On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval 
> ---
>  include/linux/fs.h  | 14 +++
>  include/uapi/linux/fs.h | 26 -
>  mm/filemap.c| 82 ++---
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT  ((__force fmode_t)0x2000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO ((__force fmode_t)0x4000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC(1 << 5)
>  #define IOCB_WRITE   (1 << 6)
>  #define IOCB_NOWAIT  (1 << 7)
> +#define IOCB_ENCODED (1 << 8)
>  
>  struct kiocb {
>   struct file *ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct 
> *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov 
> *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> + struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>   struct file *file_out, loff_t pos_out,
>   loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
> rwf_t flags)
>   return -EOPNOTSUPP;
>   ki->ki_flags |= IOCB_NOWAIT;
>   }
> + if (flags & RWF_ENCODED) {
> + if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> + return -EOPNOTSUPP;
> + ki->ki_flags |= IOCB_ENCODED;
> + }
>   if (flags & RWF_HIPRI)
>   ki->ki_flags |= IOCB_HIPRI;
>   if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> + ENCODED_IOV_COMPRESSION_NONE,
> + ENCODED_IOV_COMPRESSION_ZLIB,
> + ENCODED_IOV_COMPRESSION_LZO,
> + ENCODED_IOV_COMPRESSION_ZSTD,
> + ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> + ENCODED_IOV_ENCRYPTION_NONE,
> + ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> + __u64 len;
> + __u64 unencoded_len;
> + __u64 unencoded_offset;
> + __u32 compression;
> + __u32 encryption;

Can we add some must-be-zero padding space at the end here for whomever
comes along next wanting to add more encoding info?

(And maybe a manpage and some basic testing, to reiterate Dave...)

--D

> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI((__force __kernel_rwf_t)0x0001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND   ((__force __kernel_rwf_t)0x

Re: [PATCH 01/15] iomap: Use a srcmap for a read-modify-write I/O

2019-09-05 Thread Darrick J. Wong

On Thu, Sep 05, 2019 at 10:06:36AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> A preparation patch for copy-on-write (CoW).
> The srcmap is used to identify where the read is to be performed
> from. This is passed to iomap->begin() of the respective
> filesystem, which is supposed to put in the details for
> reading before performing the copy for CoW.
> 
> Signed-off-by: Goldwyn Rodrigues 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c   |  8 +---
>  fs/ext2/inode.c|  2 +-
>  fs/ext4/inode.c|  2 +-
>  fs/gfs2/bmap.c |  3 ++-
>  fs/iomap/apply.c   |  5 +++--
>  fs/iomap/buffered-io.c | 14 +++---
>  fs/iomap/direct-io.c   |  2 +-
>  fs/iomap/fiemap.c  |  4 ++--
>  fs/iomap/seek.c|  4 ++--
>  fs/iomap/swapfile.c|  3 ++-
>  fs/xfs/xfs_iomap.c |  9 ++---
>  include/linux/iomap.h  |  5 +++--
>  12 files changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 6bf81f931de3..e961d8dc23ef 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1090,7 +1090,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> - struct iomap *iomap)
> + struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct block_device *bdev = iomap->bdev;
>   struct dax_device *dax_dev = iomap->dax_dev;
> @@ -1248,6 +1248,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   unsigned long vaddr = vmf->address;
>   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   unsigned flags = IOMAP_FAULT;
>   int error, major = 0;
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -1292,7 +1293,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>* the file system block size to be equal the page size, which means
>* that we never have to deal with more than a single extent here.
>*/
> - error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> + error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
>   if (iomap_errp)
>   *iomap_errp = error;
>   if (error) {
> @@ -1472,6 +1473,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   struct inode *inode = mapping->host;
>   vm_fault_t result = VM_FAULT_FALLBACK;
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   pgoff_t max_pgoff;
>   void *entry;
>   loff_t pos;
> @@ -1546,7 +1548,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>* to look up our filesystem block.
>*/
>   pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> + error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, 
> &srcmap);
>   if (error)
>   goto unlock_entry;
>  
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 7004ce581a32..467c13ff6b40 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -801,7 +801,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
>  
>  #ifdef CONFIG_FS_DAX
>  static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> length,
> - unsigned flags, struct iomap *iomap)
> + unsigned flags, struct iomap *iomap, struct iomap *srcmap)
>  {
>   unsigned int blkbits = inode->i_blkbits;
>   unsigned long first_block = offset >> blkbits;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 420fe3deed39..918f94eff799 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3453,7 +3453,7 @@ static bool ext4_inode_datasync_dirty(struct inode 
> *inode)
>  }
>  
>  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> length,
> - unsigned flags, struct iomap *iomap)
> + unsigned flags, struct iomap *iomap, struct iomap 
> *srcmap)
>  {
>   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>   unsigned int blkbits = inode->i_blkbits;
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 4f8b5fd6c81f..0189262989f2 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -1164,7 +1164,8 @@ static int gfs2_iomap_begin_write(struct inode *inode, 
> loff_t pos,
>  }
>  
>  static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t

Re: [PATCH 11/15] iomap: use a function pointer for dio submits

2019-09-05 Thread Darrick J. Wong

On Thu, Sep 05, 2019 at 06:27:21PM +0200, Christoph Hellwig wrote:
> > +   if (dio->dops && dio->dops->submit_io) {
> > +   dio->dops->submit_io(bio, file_inode(dio->iocb->ki_filp),
> > +   pos);
> 
> pos would still fit on the previously line as-is.
> 
> > +   dio->submit.cookie = BLK_QC_T_NONE;
> 
> But I think you should return the cookie from the submit function for
> completeness, even if btrfs would currently always return BLK_QC_T_NONE.

Yeah, that looked funny to me too.

--D

Re: [PATCH 01/15] iomap: Introduce CONFIG_FS_IOMAP_DEBUG

2019-09-03 Thread Darrick J. Wong

On Mon, Sep 02, 2019 at 07:18:00PM +0200, Christoph Hellwig wrote:
> On Mon, Sep 02, 2019 at 10:09:16AM -0700, Darrick J. Wong wrote:
> > On Mon, Sep 02, 2019 at 06:29:34PM +0200, Christoph Hellwig wrote:
> > > On Sun, Sep 01, 2019 at 03:08:22PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues 
> > > > 
> > > > To improve debugging abilities, especially invalid option
> > > > asserts.
> > > 
> > > Looking at the code I'd much rather have unconditional WARN_ON_ONCE
> > > statements in most places.  Including returning an error when we see
> > > something invalid in most cases.
> > 
> > Yeah, I was thinking something like this, which has the advantage that
> > the report format is familiar to XFS developers and will get picked up
> > by the automated error collection stuff I put in xfstests to complain
> > about any XFS assertion failures:
> > 
> > iomap: Introduce CONFIG_FS_IOMAP_DEBUG
> > 
> > To improve debugging abilities, especially invalid option
> > asserts.
> 
> I'd actually just rather have more unconditional WARN_ON_ONCE calls,
> including actually recovering from the situation by return an actual
> error code.  That is more
> 
>   if (WARN_ON_ONCE(some_impossible_condition))
>   return -EIO;

Oh, right, WARNings actually do spit out the file and line number.
Let's do that. :)

--D

Re: [PATCH 02/15] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

2019-09-03 Thread Darrick J. Wong

On Tue, Sep 03, 2019 at 09:05:36AM -0500, Goldwyn Rodrigues wrote:
> On 18:31 02/09, Christoph Hellwig wrote:
> > On Sun, Sep 01, 2019 at 03:08:23PM -0500, Goldwyn Rodrigues wrote:
> > > --- a/include/linux/iomap.h
> > > +++ b/include/linux/iomap.h
> > > @@ -37,6 +37,7 @@ struct vm_fault;
> > >  #define IOMAP_MAPPED 0x03/* blocks allocated at @addr */
> > >  #define IOMAP_UNWRITTEN  0x04/* blocks allocated at @addr in 
> > > unwritten state */
> > >  #define IOMAP_INLINE 0x05/* data inline in the inode */
> > > +#define IOMAP_COW0x06/* copy data from srcmap before writing 
> > > */
> > 
> > I don't think IOMAP_COW can be a type - it is a flag given that we
> > can do COW operations that allocate normal written extents (e.g. for
> > direct I/O or DAX) and for delayed allocations.
> > 
> 
> Ah.. we have come a full circle on this one. From going to a flag, to a type,
> and now back to flag. Personally, I like COW to be a flag, because we are
> doing a write, just doining extra steps which should be a flag.
> From previous objections, using two iomaps should help the cause and we
> can not worry about bloating.

Heh, ok, let's do a cow flag.  Thank you for driving the consensus. :)

(Sorry it took so long while we went around in circles.)

Also, I'm going on vacation starting Friday at noon PDT, so please have
the three patches that touch fs/iomap/ in before 7am Friday.  (Assuming
you're not making drastic changes to your iomap changes, they've tested
ok so far.)

--D

> 
> -- 
> Goldwyn

Re: [PATCH v3 0/15] Btrfs iomap

2019-09-02 Thread Darrick J. Wong

On Tue, Sep 03, 2019 at 11:51:24AM +0800, Shiyang Ruan wrote:
> 
> 
> On 9/3/19 12:43 AM, Christoph Hellwig wrote:
> > On Sun, Sep 01, 2019 at 03:08:21PM -0500, Goldwyn Rodrigues wrote:
> > > This is an effort to use iomap for btrfs. This would keep most
> > > responsibility of page handling during writes in iomap code, hence
> > > code reduction. For CoW support, changes are needed in iomap code
> > > to make sure we perform a copy before the write.
> > > This is in line with the discussion we had during adding dax support in
> > > btrfs.
> > 
> > This looks pretty good modulo a few comments.
> > 
> > Can you please convert the XFS code to use your two iomaps for COW
> > approach as well to validate it?
> 
> Hi,
> 
> The XFS part of dax CoW support has been implementing recently.  Please
> review this[1] if necessary.  It's based on this iomap patchset(the 1st
> version), and uses the new srcmap.
> 
> [1]: https://lkml.org/lkml/2019/7/31/449

It sure would be nice to have (a) this patchset of Goldwyn's cleaned up
a bit per the review comments and (b) the XFS DAX COW series rebased on
that (instead of a month-old submission). ;)

--D

> -- 
> Thanks,
> Shiyang Ruan.
> > 
> > Also the iomap_file_dirty helper would really benefit from using the
> > two iomaps, any chance you could look into improving it to use your
> > new infrastructure?
> > 
> > 
> 
> 
>

Re: [PATCH 02/15] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

2019-09-02 Thread Darrick J. Wong

On Mon, Sep 02, 2019 at 06:31:04PM +0200, Christoph Hellwig wrote:
> On Sun, Sep 01, 2019 at 03:08:23PM -0500, Goldwyn Rodrigues wrote:
> > --- a/include/linux/iomap.h
> > +++ b/include/linux/iomap.h
> > @@ -37,6 +37,7 @@ struct vm_fault;
> >  #define IOMAP_MAPPED   0x03/* blocks allocated at @addr */
> >  #define IOMAP_UNWRITTEN0x04/* blocks allocated at @addr in 
> > unwritten state */
> >  #define IOMAP_INLINE   0x05/* data inline in the inode */
> > +#define IOMAP_COW  0x06/* copy data from srcmap before writing */
> 
> I don't think IOMAP_COW can be a type - it is a flag given that we
> can do COW operations that allocate normal written extents (e.g. for
> direct I/O or DAX) and for delayed allocations.

If iomap_apply always zeros out @srcmap before calling ->iomap_begin, do
we even need a flag/type code?  Or does it suffice to check that
srcmap.length > 0 and use it appropriately?

--D

Re: [PATCH 01/15] iomap: Introduce CONFIG_FS_IOMAP_DEBUG

2019-09-02 Thread Darrick J. Wong

On Mon, Sep 02, 2019 at 06:29:34PM +0200, Christoph Hellwig wrote:
> On Sun, Sep 01, 2019 at 03:08:22PM -0500, Goldwyn Rodrigues wrote:
> > From: Goldwyn Rodrigues 
> > 
> > To improve debugging abilities, especially invalid option
> > asserts.
> 
> Looking at the code I'd much rather have unconditional WARN_ON_ONCE
> statements in most places.  Including returning an error when we see
> something invalid in most cases.

Yeah, I was thinking something like this, which has the advantage that
the report format is familiar to XFS developers and will get picked up
by the automated error collection stuff I put in xfstests to complain
about any XFS assertion failures:

iomap: Introduce CONFIG_FS_IOMAP_DEBUG

To improve debugging abilities, especially invalid option
asserts.

Signed-off-by: Goldwyn Rodrigues 
[darrick: restructure it to follow what xfs does]
Reviewed-by: Darrick J. Wong 
Signed-off-by: Darrick J. Wong 
---
 fs/Kconfig|3 +++
 fs/iomap/apply.c  |   10 ++
 include/linux/iomap.h |   14 ++
 3 files changed, 27 insertions(+)

diff --git a/fs/Kconfig b/fs/Kconfig
index bfb1c6095c7a..4bed5df9b55f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -19,6 +19,9 @@ if BLOCK
 
 config FS_IOMAP
bool
+config FS_IOMAP_DEBUG
+   bool "Debugging for the iomap code"
+   depends on FS_IOMAP
 
 source "fs/ext2/Kconfig"
 source "fs/ext4/Kconfig"
diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
index 54c02aecf3cd..95cc3a2cadd5 100644
--- a/fs/iomap/apply.c
+++ b/fs/iomap/apply.c
@@ -8,6 +8,16 @@
 #include 
 #include 
 
+#ifdef CONFIG_FS_IOMAP_DEBUG
+void
+iomap_assertion_failed(const char *expr, const char *file, int line)
+{
+   printk(KERN_EMERG "IOMAP: Assertion failed: %s, file: %s, line: %d",
+  expr, file, line);
+   WARN_ON_ONCE(1);
+}
+#endif
+
 /*
  * Execute a iomap write on a segment of the mapping that spans a
  * contiguous range of pages that have identical block mapping state.
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 834d3923e2f2..b3d5d6f323cf 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -20,6 +20,20 @@ struct page;
 struct vm_area_struct;
 struct vm_fault;
 
+#ifdef CONFIG_FS_IOMAP_DEBUG
+
+extern void iomap_assertion_failed(const char *expr, const char *f, int l);
+
+#define IOMAP_ASSERT(expr) \
+   do { \
+   if (unlikely(!(expr))) \
+   iomap_assertion_failed(#expr, __FILE__, __LINE__); \
+   } while(0)
+
+#else
+#define IOMAP_ASSERT(expr) ((void)0)
+#endif
+
 /*
  * Types of block ranges for iomap mappings:
  */

Re: [PATCH 03/15] iomap: Read page from srcmap for IOMAP_COW

2019-09-02 Thread Darrick J. Wong

On Mon, Sep 02, 2019 at 06:31:24PM +0200, Christoph Hellwig wrote:
> On Sun, Sep 01, 2019 at 03:08:24PM -0500, Goldwyn Rodrigues wrote:
> 
> > +   iomap_assert(!(iomap->flags & IOMAP_F_BUFFER_HEAD));
> > +   iomap_assert(srcmap->type == IOMAP_HOLE || srcmap->addr > 0);
> 
> 0 can be a valid address in various file systems, so I don't think we
> can just exclude it.  Then again COWing from a hole seems pointless,
> doesn't it?

XFS does that if you set a cowextsize hint and a speculative cow
preallocation ends up covering a hole.  Granted I don't think there's
much point in reading from a COW fork extent to fill in an unaligned
buffered write since it /should/ just end up zero-filling the pagecache
regardless of fork... but I don't see much harm in doing that.

--D

> So just check for addr != IOMAP_NULL_ADDR here?
> 
> >  
> > @@ -961,7 +966,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, 
> > loff_t count,
> > if (IS_DAX(inode))
> > status = iomap_dax_zero(pos, offset, bytes, iomap);
> > else
> > -   status = iomap_zero(inode, pos, offset, bytes, iomap);
> > +   status = iomap_zero(inode, pos, offset, bytes, iomap, 
> > srcmap);
> 
> This introduces an > 80 character line.

Re: [PATCH 01/15] iomap: Introduce CONFIG_FS_IOMAP_DEBUG

2019-09-01 Thread Darrick J. Wong

On Sun, Sep 01, 2019 at 03:08:22PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> To improve debugging abilities, especially invalid option
> asserts.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/Kconfig|  3 +++
>  include/linux/iomap.h | 11 +++
>  2 files changed, 14 insertions(+)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index bfb1c6095c7a..4bed5df9b55f 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -19,6 +19,9 @@ if BLOCK
>  
>  config FS_IOMAP
>   bool
> +config FS_IOMAP_DEBUG
> + bool "Debugging for the iomap code"
> + depends on FS_IOMAP
>  
>  source "fs/ext2/Kconfig"
>  source "fs/ext4/Kconfig"
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index bc499ceae392..209b6c93674e 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -18,6 +18,17 @@ struct page;
>  struct vm_area_struct;
>  struct vm_fault;
>  
> +#ifdef CONFIG_FS_IOMAP_DEBUG
> +#define iomap_assert(expr) \
> + if(!(expr)) { \
> + printk( "Assertion failed! %s,%s,%s,line=%d\n",\
> +#expr,__FILE__,__func__,__LINE__); \
> + BUG(); \

Hmm, this log message ought to have a priority level, right?

#define IOMAP_ASSERT(expr) do { WARN_ON(!(expr)); } while(0)

(or crib ASSERT/ass{fail,warn} from XFS? :D)

> + }
> +#else
> +#define iomap_assert(expr)

((void)0)

So we don't just have stray semicolons in the code stream?

--D

> +#endif
> +
>  /*
>   * Types of block ranges for iomap mappings:
>   */
> -- 
> 2.16.4
>

Re: [PATCH] fstests: generic/500 doesn't work for btrfs

2019-08-19 Thread Darrick J. Wong

On Sun, Aug 18, 2019 at 11:44:28PM +0800, Eryu Guan wrote:
> On Thu, Aug 15, 2019 at 02:26:59PM -0400, Josef Bacik wrote:
> > Btrfs does COW, so when we unlink the file we need to update metadata
> > and write it to a new location, which we can't do because the thinp is
> > full.  This results in an EIO during a metadata write, which makes us
> > flip read only, thus making it impossible to fstrim the fs.  Just make
> > it so we skip this test for btrfs.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  tests/generic/500 | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/tests/generic/500 b/tests/generic/500
> > index 201d8b9f..5cd7126f 100755
> > --- a/tests/generic/500
> > +++ b/tests/generic/500
> > @@ -49,6 +49,12 @@ _supported_os Linux
> >  _require_scratch_nocheck
> >  _require_dm_target thin-pool
> >  
> > +# The unlink below will result in new metadata blocks for btrfs because of 
> > CoW,
> > +# and since we've filled the thinp device it'll return EIO, which will make
> > +# btrfs flip read only, making it fail this test when it just won't work 
> > right
> > +# for us in the first place.
> > +test $FSTYP == "btrfs"  && _notrun "btrfs doesn't work that way lol"
> 
> I'm wondering if we could introduce a proper _require rule to cover this
> case? e.g. require the fs doesn't allocate new blocks on unlink? or
> something like that. But I'm not sure what's the proper fs feature to
> require here, any suggestions?

I'd be careful with this -- xfs can allocate new metadata blocks on
unlink too -- changes in the free space btrees, expansion of the free
inode btree, etc.  For the 20 inodes in play in g/500 this won't be the
case, but if you had a test that created 20,000 inodes, then that could
happen.

--D

> Thanks,
> Eryu
> 
> > +
> >  # Require underlying device support discard
> >  _scratch_mkfs >>$seqres.full 2>&1
> >  _scratch_mount
> > -- 
> > 2.21.0
> >

Re: [PATCH] fstests: generic/500 doesn't work for btrfs

2019-08-15 Thread Darrick J. Wong

On Thu, Aug 15, 2019 at 02:26:59PM -0400, Josef Bacik wrote:
> Btrfs does COW, so when we unlink the file we need to update metadata
> and write it to a new location, which we can't do because the thinp is
> full.  This results in an EIO during a metadata write, which makes us
> flip read only, thus making it impossible to fstrim the fs.  Just make
> it so we skip this test for btrfs.
> 
> Signed-off-by: Josef Bacik 
> ---
>  tests/generic/500 | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/tests/generic/500 b/tests/generic/500
> index 201d8b9f..5cd7126f 100755
> --- a/tests/generic/500
> +++ b/tests/generic/500
> @@ -49,6 +49,12 @@ _supported_os Linux
>  _require_scratch_nocheck
>  _require_dm_target thin-pool
>  
> +# The unlink below will result in new metadata blocks for btrfs because of 
> CoW,
> +# and since we've filled the thinp device it'll return EIO, which will make
> +# btrfs flip read only, making it fail this test when it just won't work 
> right
> +# for us in the first place.
> +test $FSTYP == "btrfs"  && _notrun "btrfs doesn't work that way lol"

I did it for the lulz,
Reviewed-by: Darrick J. Wong 

--D

> +
>  # Require underlying device support discard
>  _scratch_mkfs >>$seqres.full 2>&1
>  _scratch_mount
> -- 
> 2.21.0
>

Re: [PATCH 1/2] fstests: make generic/500 xfs+ext4 only

2019-08-15 Thread Darrick J. Wong

On Thu, Aug 15, 2019 at 11:00:31AM -0400, Josef Bacik wrote:
> I recently fixed some bugs in btrfs's enospc handling that made it start
> failing generic/500.
> 
> The point of this test is to make the thin provisioned device run out of
> space, which results in an EIO being seen on a device from the file
> systems perspective.  This is fine for xfs and ext4 who's metadata is
> being overwritten and already allocated on the thin provisioned device.
> They get an EIO on data writes, fstrim to free up the space, and keep it
> going.
> 
> Btrfs however has dynamic metadata, so the rm -rf could result in
> metadata IO being done on the file system.  Since the thin provisioned
> device is out of space this gives us an EIO, and we flip read only.  We
> didn't remove the file, so the fstrim doesn't recover space anyway, so
> we can't even fstrim and remount.
> 
> Make this test for ext4/xfs only, it just simply won't work right for
> btrfs in it's current form.

How about:

test $FSTYP = "btrfs" && _notrun "btrfs doesn't work that way lol"

since afaik btrfs is the only fs that shouldn't run this test?
Also, I think Ted was trying to kill off tests/shared/...

--D

> 
> Signed-off-by: Josef Bacik 
> ---
>  tests/generic/500 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tests/generic/500 b/tests/generic/500
> index 201d8b9f..1cbd9d65 100755
> --- a/tests/generic/500
> +++ b/tests/generic/500
> @@ -44,7 +44,7 @@ _cleanup()
>  rm -f $seqres.full
>  
>  # real QA test starts here
> -_supported_fs generic
> +_supported_fs xfs ext4
>  _supported_os Linux
>  _require_scratch_nocheck
>  _require_dm_target thin-pool
> -- 
> 2.21.0
>

Re: mke2fs accepts block size not mentioned in its man page

2019-08-08 Thread Darrick J. Wong

On Fri, Jul 05, 2019 at 01:35:02PM +0800, Qu Wenruo wrote:
> Hi,
> 
> Just doing some tests on aarch64 with 64K page size.
> 
> Man page of mke2fs only mentions 3 valid block size: 1k, 2k, 4k.
> But in real world, we can pass 64K as block size for it without any problem:
> 
>   $mke2fs -F -t ext3 -b 65536 /dev/loop1
>   Warning: blocksize 65536 not usable on most systems.
>   mke2fs 1.45.2 (27-May-2019)
>   /dev/loop1 contains a btrfs file system
>   Discarding device blocks: done
>   Creating filesystem with 81920 64k blocks and 81920 inodes
>   Filesystem UUID: 311bb224-6d2d-44a7-9790-92c4878d6549
>   [...]
> 
> It's great to see mke2fs accepts 64K as nodesize, which allows
> btrfs-convert to work.
> (If blocksize is default to 4K or doesn't accept 64K page size,
> btrfs-convert can work but can't be mounted on system with 64K page size)
> 
> Shouldn't the man page mention all valid values?

You'd think so, but 64k blocks only works on machines with 64k pages,
so that's why it doesn't mention anything beyond the lowest common
denominator. :/

--D

> Thanks,
> Qu
>

Re: [PATCH 01/13] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

2019-08-02 Thread Darrick J. Wong

On Fri, Aug 02, 2019 at 05:00:36PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Introduces a new type IOMAP_COW, which means the data at offset
> must be read from a srcmap and copied before performing the
> write on the offset.
> 
> The srcmap is used to identify where the read is to be performed
> from. This is passed to iomap->begin() of the respective
> filesystem, which is supposed to put in the details for
> reading before performing the copy for CoW.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c   |  8 +---
>  fs/ext2/inode.c|  2 +-
>  fs/ext4/inode.c|  2 +-
>  fs/gfs2/bmap.c |  3 ++-
>  fs/iomap/apply.c   |  5 +++--
>  fs/iomap/buffered-io.c | 14 +++---
>  fs/iomap/direct-io.c   |  2 +-
>  fs/iomap/fiemap.c  |  4 ++--
>  fs/iomap/seek.c|  4 ++--
>  fs/iomap/swapfile.c|  3 ++-
>  fs/xfs/xfs_iomap.c |  9 ++---
>  include/linux/iomap.h  |  6 --
>  12 files changed, 36 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index a237141d8787..b21d9a9cde2b 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1090,7 +1090,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> - struct iomap *iomap)
> + struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct block_device *bdev = iomap->bdev;
>   struct dax_device *dax_dev = iomap->dax_dev;
> @@ -1248,6 +1248,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   unsigned long vaddr = vmf->address;
>   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   unsigned flags = IOMAP_FAULT;
>   int error, major = 0;
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -1292,7 +1293,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>* the file system block size to be equal the page size, which means
>* that we never have to deal with more than a single extent here.
>*/
> - error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> + error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
>   if (iomap_errp)
>   *iomap_errp = error;
>   if (error) {
> @@ -1472,6 +1473,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   struct inode *inode = mapping->host;
>   vm_fault_t result = VM_FAULT_FALLBACK;
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   pgoff_t max_pgoff;
>   void *entry;
>   loff_t pos;
> @@ -1546,7 +1548,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>* to look up our filesystem block.
>*/
>   pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> + error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, 
> &srcmap);

/me wonders aloud if he ought to add a helper function to standardize at
least some of validation of the iomap that gets returned from
->iomap_begin invocations...

>   if (error)
>   goto unlock_entry;
>  



> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 54c02aecf3cd..6cdb362fff36 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -24,6 +24,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
> unsigned flags,
>   const struct iomap_ops *ops, void *data, iomap_actor_t actor)
>  {
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   loff_t written = 0, ret;
>  
>   /*
> @@ -38,7 +39,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
> unsigned flags,
>* expose transient stale data. If the reserve fails, we can safely
>* back out at this point as there is nothing to undo.
>*/
> - ret = ops->iomap_begin(inode, pos, length, flags, &iomap);
> + ret = ops->iomap_begin(inode, pos, length, flags, &iomap, &srcmap);
>   if (ret)
>   return ret;
>   if (WARN_ON(iomap.offset > pos))

...because I wonder if we ought to have a debugging assert here just in
case an ->iomap_begin returns IOMAP_COW in response to an IOMAP_WRITE
request?  Basic sanity checks to catch accidental API misuse, etc.

Eh we probably ought to have a CONFIG_IOMAP_DEBUG so that non-developers
don't necessarily have to pay the assert costs or something like that.

> @@ -58,7 +59,7 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, 
> unsigned flags,
>* we can do the copy-in page by page without having to worry about
>* failures exposing transient data.
>*/
> - written = actor(inode, pos, length, data, &iomap);
> + written = actor(inode, pos, length, data, &iomap, &srcmap);
>  
>   /*
>*

Re: [PATCH 02/13] iomap: Read page from srcmap for IOMAP_COW

2019-08-02 Thread Darrick J. Wong

On Fri, Aug 02, 2019 at 05:00:37PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> In case of a IOMAP_COW, read a page from the srcmap before
> performing a write on the page.

Looks ok, I think...
Reviewed-by: Darrick J. Wong 

--D

> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/iomap/buffered-io.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f27756c0b31c..a96cc26eec92 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -581,7 +581,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len,
>  
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
> - struct page **pagep, struct iomap *iomap)
> + struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
>  {
>   const struct iomap_page_ops *page_ops = iomap->page_ops;
>   pgoff_t index = pos >> PAGE_SHIFT;
> @@ -607,6 +607,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>  
>   if (iomap->type == IOMAP_INLINE)
>   iomap_read_inline_data(inode, page, iomap);
> + else if (iomap->type == IOMAP_COW)
> + status = __iomap_write_begin(inode, pos, len, page, srcmap);
>   else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>   status = __block_write_begin_int(page, pos, len, NULL, iomap);
>   else
> @@ -772,7 +774,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>   }
>  
>   status = iomap_write_begin(inode, pos, bytes, flags, &page,
> - iomap);
> + iomap, srcmap);
>   if (unlikely(status))
>   break;
>  
> @@ -871,7 +873,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>   return PTR_ERR(rpage);
>  
>   status = iomap_write_begin(inode, pos, bytes,
> -AOP_FLAG_NOFS, &page, iomap);
> +AOP_FLAG_NOFS, &page, iomap, srcmap);
>   put_page(rpage);
>   if (unlikely(status))
>   return status;
> @@ -917,13 +919,13 @@ iomap_file_dirty(struct inode *inode, loff_t pos, 
> loff_t len,
>  EXPORT_SYMBOL_GPL(iomap_file_dirty);
>  
>  static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> - unsigned bytes, struct iomap *iomap)
> + unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct page *page;
>   int status;
>  
>   status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
> -iomap);
> +iomap, srcmap);
>   if (status)
>   return status;
>  
> @@ -961,7 +963,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, 
> loff_t count,
>   if (IS_DAX(inode))
>   status = iomap_dax_zero(pos, offset, bytes, iomap);
>   else
> - status = iomap_zero(inode, pos, offset, bytes, iomap);
> + status = iomap_zero(inode, pos, offset, bytes, iomap, 
> srcmap);
>   if (status < 0)
>   return status;
>  
> -- 
> 2.16.4
>

Re: [PATCH 10/13] iomap: use a function pointer for dio submits

2019-08-02 Thread Darrick J. Wong

On Fri, Aug 02, 2019 at 05:00:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> This helps filesystems to perform tasks on the bio while
> submitting for I/O. Since btrfs requires the position
> we are working on, pass pos to iomap_dio_submit_bio()

What /does/ btrfs_submit_direct do, anyway?  Looks like it's a custom
submission function that ... does something related to setting
checksums?  And, uh, RAID?

> The correct place for submit_io() is not page_ops. Would it
> better to rename the structure to something like iomap_io_ops
> or put it directly under struct iomap?

Seeing as the ->iomap_begin handler knows if the requested op is a
buffered write or a direct write, what if we just declare a union of
ops?

e.g.

struct iomap_page_ops;
struct iomap_directio_ops;

struct iomap {

union {
const struct iomap_page_ops *page_ops;
const struct iomap_directio_ops *directio_ops;
};
};

--D

> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/iomap/direct-io.c  | 16 +++-
>  include/linux/iomap.h |  1 +
>  2 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5279029c7a3c..a802e66bf11f 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -59,7 +59,7 @@ int iomap_dio_iopoll(struct kiocb *kiocb, bool spin)
>  EXPORT_SYMBOL_GPL(iomap_dio_iopoll);
>  
>  static void iomap_dio_submit_bio(struct iomap_dio *dio, struct iomap *iomap,
> - struct bio *bio)
> + struct bio *bio, loff_t pos)
>  {
>   atomic_inc(&dio->ref);
>  
> @@ -67,7 +67,13 @@ static void iomap_dio_submit_bio(struct iomap_dio *dio, 
> struct iomap *iomap,
>   bio_set_polled(bio, dio->iocb);
>  
>   dio->submit.last_queue = bdev_get_queue(iomap->bdev);
> - dio->submit.cookie = submit_bio(bio);
> + if (iomap->page_ops && iomap->page_ops->submit_io) {
> + iomap->page_ops->submit_io(bio, file_inode(dio->iocb->ki_filp),
> + pos);
> + dio->submit.cookie = BLK_QC_T_NONE;
> + } else {
> + dio->submit.cookie = submit_bio(bio);
> + }
>  }
>  
>  static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> @@ -195,7 +201,7 @@ iomap_dio_zero(struct iomap_dio *dio, struct iomap 
> *iomap, loff_t pos,
>   get_page(page);
>   __bio_add_page(bio, page, len, 0);
>   bio_set_op_attrs(bio, REQ_OP_WRITE, flags);
> - iomap_dio_submit_bio(dio, iomap, bio);
> + iomap_dio_submit_bio(dio, iomap, bio, pos);
>  }
>  
>  static loff_t
> @@ -301,11 +307,11 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   iov_iter_advance(dio->submit.iter, n);
>  
>   dio->size += n;
> - pos += n;
>   copied += n;
>  
>   nr_pages = iov_iter_npages(&iter, BIO_MAX_PAGES);
> - iomap_dio_submit_bio(dio, iomap, bio);
> + iomap_dio_submit_bio(dio, iomap, bio, pos);
> + pos += n;
>   } while (nr_pages);
>  
>   /*
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5b2055e8ca8a..6617e4b6fb6d 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -92,6 +92,7 @@ struct iomap_page_ops {
>   struct iomap *iomap);
>   void (*page_done)(struct inode *inode, loff_t pos, unsigned copied,
>   struct page *page, struct iomap *iomap);
> + dio_submit_t*submit_io;
>  };
>  
>  /*
> -- 
> 2.16.4
>

[PATCH v2 4/4] vfs: don't allow most setxattr to immutable files

2019-07-01 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

However, we don't actually check the immutable flag in the setattr code,
which means that we can update inode flags and project ids and extent
size hints on supposedly immutable files.  Therefore, reject setflags
and fssetxattr calls on an immutable file if the file is immutable and
will remain that way.

Signed-off-by: Darrick J. Wong 
---
v2: use memcmp instead of open coding a bunch of checks
---
 fs/inode.c |   17 +
 1 file changed, 17 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index cf07378e5731..31f694e405fe 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2214,6 +2214,14 @@ int vfs_ioc_setflags_prepare(struct inode *inode, 
unsigned int oldflags,
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
+   /*
+* We aren't allowed to change any other flags if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((oldflags & FS_IMMUTABLE_FL) && (flags & FS_IMMUTABLE_FL) &&
+   oldflags != flags)
+   return -EPERM;
+
/*
 * Now that we're done checking the new flags, flush all pending IO and
 * dirty mappings before setting S_IMMUTABLE on an inode via
@@ -2284,6 +2292,15 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
 
+   /*
+* We aren't allowed to change any fields if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((old_fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
+   (fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
+   memcmp(fa, old_fa, offsetof(struct fsxattr, fsx_pad)))
+   return -EPERM;
+
/* Extent size hints of zero turn off the flags. */
if (fa->fsx_extsize == 0)
fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);

[PATCH 3/4] vfs: flush and wait for io when setting the immutable flag via FSSETXATTR

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_FSSETXATTR to set the immutable flag on a file,
we need to ensure that userspace can't continue to write the file after
the file becomes immutable.  To make that happen, we have to flush all
the dirty pagecache pages to disk to ensure that we can fail a page
fault on a mmap'd region, wait for pending directio to complete, and
hope the caller locked out any new writes by holding the inode lock.

XFS has more complex locking than other FSSETXATTR implementations so we
have to keep the checking and preparation code in different functions.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c   |2 +
 fs/ext4/ioctl.c|2 +
 fs/f2fs/file.c |2 +
 fs/inode.c |   31 +++
 fs/xfs/xfs_ioctl.c |   71 +++-
 include/linux/fs.h |3 ++
 6 files changed, 90 insertions(+), 21 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3cd66efdb99d..aeffe3fd99c4 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -420,7 +420,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
 
simple_fill_fsxattr(&old_fa,
btrfs_inode_flags_to_xflags(binode->flags));
-   ret = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   ret = vfs_ioc_fssetxattr_prepare(inode, &old_fa, &fa);
if (ret)
goto out_unlock;
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 566dfac28b3f..69810e59f89a 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1109,7 +1109,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 
inode_lock(inode);
ext4_fill_fsxattr(inode, &old_fa);
-   err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   err = vfs_ioc_fssetxattr_prepare(inode, &old_fa, &fa);
if (err)
goto out;
flags = (ei->i_flags & ~EXT4_FL_XFLAG_VISIBLE) |
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 8799468724f9..b47f22eb483e 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2825,7 +2825,7 @@ static int f2fs_ioc_fssetxattr(struct file *filp, 
unsigned long arg)
inode_lock(inode);
 
f2fs_fill_fsxattr(inode, &old_fa);
-   err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   err = vfs_ioc_fssetxattr_prepare(inode, &old_fa, &fa);
if (err)
goto out;
flags = (fi->i_flags & ~F2FS_FL_XFLAG_VISIBLE) |
diff --git a/fs/inode.c b/fs/inode.c
index 65a412af3ffb..cf07378e5731 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2293,3 +2293,34 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
return 0;
 }
 EXPORT_SYMBOL(vfs_ioc_fssetxattr_check);
+
+/*
+ * Generic function to check FS_IOC_FSSETXATTR values and reject any invalid
+ * configurations.  If none are found, flush all pending IO and dirty mappings
+ * before setting S_IMMUTABLE on an inode.  If the flush fails we'll clear the
+ * flag before returning error.
+ *
+ * Note: the caller must hold whatever locks are necessary to block any other
+ * threads from starting a write to the file.
+ */
+int vfs_ioc_fssetxattr_prepare(struct inode *inode,
+  const struct fsxattr *old_fa,
+  struct fsxattr *fa)
+{
+   int ret;
+
+   ret = vfs_ioc_fssetxattr_check(inode, old_fa, fa);
+   if (ret)
+   return ret;
+
+   if (!S_ISREG(inode->i_mode) || IS_IMMUTABLE(inode) ||
+   !(fa->fsx_xflags & FS_XFLAG_IMMUTABLE))
+   return 0;
+
+   inode_set_flags(inode, S_IMMUTABLE, S_IMMUTABLE);
+   ret = inode_drain_writes(inode);
+   if (ret)
+   inode_set_flags(inode, 0, S_IMMUTABLE);
+   return ret;
+}
+EXPORT_SYMBOL(vfs_ioc_fssetxattr_prepare);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index fe29aa61293c..552f18554c48 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1057,6 +1057,30 @@ xfs_ioctl_setattr_xflags(
return 0;
 }
 
+/*
+ * If we're setting immutable on a regular file, we need to prevent new writes.
+ * Once we've done that, we must wait for all the other writes to complete.
+ *
+ * The caller must use @join_flags to release the locks which are held on @ip
+ * regardless of return value.
+ */
+static int
+xfs_ioctl_setattr_drain_writes(
+   struct xfs_inode*ip,
+   const struct fsxattr*fa,
+   int *join_flags)
+{
+   struct inode*inode = VFS_I(ip);
+
+   if (!S_ISREG(inode->i_mode) || !(fa->fsx_xflags & FS_XFLAG_IMMUTABLE))
+   return 0;
+
+   *join_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+   xfs_ilock(ip, *join_flags);
+
+   return inode_drain_writes(inode);
+}

[PATCH 2/5] vfs: create a generic checking function for FS_IOC_FSSETXATTR

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

Create a generic checking function for the incoming FS_IOC_FSSETXATTR
fsxattr values so that we can standardize some of the implementation
behaviors.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara 
---
 fs/btrfs/ioctl.c   |   17 +
 fs/ext4/ioctl.c|   25 +--
 fs/f2fs/file.c |   27 ++--
 fs/inode.c |   23 +
 fs/xfs/xfs_ioctl.c |   69 ++--
 include/linux/fs.h |9 +++
 6 files changed, 113 insertions(+), 57 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d3d9b4abb09b..3cd66efdb99d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -375,9 +375,7 @@ static int btrfs_ioctl_fsgetxattr(struct file *file, void 
__user *arg)
struct btrfs_inode *binode = BTRFS_I(file_inode(file));
struct fsxattr fa;
 
-   memset(&fa, 0, sizeof(fa));
-   fa.fsx_xflags = btrfs_inode_flags_to_xflags(binode->flags);
-
+   simple_fill_fsxattr(&fa, btrfs_inode_flags_to_xflags(binode->flags));
if (copy_to_user(arg, &fa, sizeof(fa)))
return -EFAULT;
 
@@ -390,7 +388,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
struct btrfs_inode *binode = BTRFS_I(inode);
struct btrfs_root *root = binode->root;
struct btrfs_trans_handle *trans;
-   struct fsxattr fa;
+   struct fsxattr fa, old_fa;
unsigned old_flags;
unsigned old_i_flags;
int ret = 0;
@@ -401,7 +399,6 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
if (btrfs_root_readonly(root))
return -EROFS;
 
-   memset(&fa, 0, sizeof(fa));
if (copy_from_user(&fa, arg, sizeof(fa)))
return -EFAULT;
 
@@ -421,13 +418,11 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
old_flags = binode->flags;
old_i_flags = inode->i_flags;
 
-   /* We need the capabilities to change append-only or immutable inode */
-   if (((old_flags & (BTRFS_INODE_APPEND | BTRFS_INODE_IMMUTABLE)) ||
-(fa.fsx_xflags & (FS_XFLAG_APPEND | FS_XFLAG_IMMUTABLE))) &&
-   !capable(CAP_LINUX_IMMUTABLE)) {
-   ret = -EPERM;
+   simple_fill_fsxattr(&old_fa,
+   btrfs_inode_flags_to_xflags(binode->flags));
+   ret = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (ret)
goto out_unlock;
-   }
 
if (fa.fsx_xflags & FS_XFLAG_SYNC)
binode->flags |= BTRFS_INODE_SYNC;
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 272b6e44191b..1974cb755d09 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -721,6 +721,17 @@ static int ext4_ioctl_check_project(struct inode *inode, 
struct fsxattr *fa)
return 0;
 }
 
+static void ext4_fill_fsxattr(struct inode *inode, struct fsxattr *fa)
+{
+   struct ext4_inode_info *ei = EXT4_I(inode);
+
+   simple_fill_fsxattr(fa, ext4_iflags_to_xflags(ei->i_flags &
+ EXT4_FL_USER_VISIBLE));
+
+   if (ext4_has_feature_project(inode->i_sb))
+   fa->fsx_projid = from_kprojid(&init_user_ns, ei->i_projid);
+}
+
 long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
struct inode *inode = file_inode(filp);
@@ -1089,13 +1100,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
{
struct fsxattr fa;
 
-   memset(&fa, 0, sizeof(struct fsxattr));
-   fa.fsx_xflags = ext4_iflags_to_xflags(ei->i_flags & 
EXT4_FL_USER_VISIBLE);
-
-   if (ext4_has_feature_project(inode->i_sb)) {
-   fa.fsx_projid = (__u32)from_kprojid(&init_user_ns,
-   EXT4_I(inode)->i_projid);
-   }
+   ext4_fill_fsxattr(inode, &fa);
 
if (copy_to_user((struct fsxattr __user *)arg,
 &fa, sizeof(fa)))
@@ -1104,7 +1109,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
}
case EXT4_IOC_FSSETXATTR:
{
-   struct fsxattr fa;
+   struct fsxattr fa, old_fa;
int err;
 
if (copy_from_user(&fa, (struct fsxattr __user *)arg,
@@ -1127,7 +1132,11 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
return err;
 
inode_lock(inode);
+   ext4_fill_fsxattr(inode, &old_fa);
err = ext4_ioctl_check_project(inode, &fa);
+   if (err)
+   goto out;
+   err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
if (err)

[PATCH 3/5] vfs: teach vfs_ioc_fssetxattr_check to check project id info

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

Standardize the project id checks for FSSETXATTR.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara 
---
 fs/ext4/ioctl.c|   27 ---
 fs/f2fs/file.c |   27 ---
 fs/inode.c |   13 +
 fs/xfs/xfs_ioctl.c |   15 ---
 4 files changed, 13 insertions(+), 69 deletions(-)


diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 1974cb755d09..566dfac28b3f 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -697,30 +697,6 @@ static long ext4_ioctl_group_add(struct file *file,
return err;
 }
 
-static int ext4_ioctl_check_project(struct inode *inode, struct fsxattr *fa)
-{
-   /*
-* Project Quota ID state is only allowed to change from within the init
-* namespace. Enforce that restriction only if we are trying to change
-* the quota ID state. Everything else is allowed in user namespaces.
-*/
-   if (current_user_ns() == &init_user_ns)
-   return 0;
-
-   if (__kprojid_val(EXT4_I(inode)->i_projid) != fa->fsx_projid)
-   return -EINVAL;
-
-   if (ext4_test_inode_flag(inode, EXT4_INODE_PROJINHERIT)) {
-   if (!(fa->fsx_xflags & FS_XFLAG_PROJINHERIT))
-   return -EINVAL;
-   } else {
-   if (fa->fsx_xflags & FS_XFLAG_PROJINHERIT)
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
 static void ext4_fill_fsxattr(struct inode *inode, struct fsxattr *fa)
 {
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -1133,9 +1109,6 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 
inode_lock(inode);
ext4_fill_fsxattr(inode, &old_fa);
-   err = ext4_ioctl_check_project(inode, &fa);
-   if (err)
-   goto out;
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
if (err)
goto out;
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 8da95b84520c..8799468724f9 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2796,30 +2796,6 @@ static int f2fs_ioc_fsgetxattr(struct file *filp, 
unsigned long arg)
return 0;
 }
 
-static int f2fs_ioctl_check_project(struct inode *inode, struct fsxattr *fa)
-{
-   /*
-* Project Quota ID state is only allowed to change from within the init
-* namespace. Enforce that restriction only if we are trying to change
-* the quota ID state. Everything else is allowed in user namespaces.
-*/
-   if (current_user_ns() == &init_user_ns)
-   return 0;
-
-   if (__kprojid_val(F2FS_I(inode)->i_projid) != fa->fsx_projid)
-   return -EINVAL;
-
-   if (F2FS_I(inode)->i_flags & F2FS_PROJINHERIT_FL) {
-   if (!(fa->fsx_xflags & FS_XFLAG_PROJINHERIT))
-   return -EINVAL;
-   } else {
-   if (fa->fsx_xflags & FS_XFLAG_PROJINHERIT)
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
 static int f2fs_ioc_fssetxattr(struct file *filp, unsigned long arg)
 {
struct inode *inode = file_inode(filp);
@@ -2847,9 +2823,6 @@ static int f2fs_ioc_fssetxattr(struct file *filp, 
unsigned long arg)
return err;
 
inode_lock(inode);
-   err = f2fs_ioctl_check_project(inode, &fa);
-   if (err)
-   goto out;
 
f2fs_fill_fsxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
diff --git a/fs/inode.c b/fs/inode.c
index fdd6c5d3e48d..c4f8fb16f633 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2234,6 +2234,19 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
+   /*
+* Project Quota ID state is only allowed to change from within the init
+* namespace. Enforce that restriction only if we are trying to change
+* the quota ID state. Everything else is allowed in user namespaces.
+*/
+   if (current_user_ns() != &init_user_ns) {
+   if (old_fa->fsx_projid != fa->fsx_projid)
+   return -EINVAL;
+   if ((old_fa->fsx_xflags ^ fa->fsx_xflags) &
+   FS_XFLAG_PROJINHERIT)
+   return -EINVAL;
+   }
+
return 0;
 }
 EXPORT_SYMBOL(vfs_ioc_fssetxattr_check);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 458a7043b4d2..f494c01342c6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1298,21 +1298,6 @@ xfs_ioctl_setattr_check_projid(
if (fa->fsx_projid > (uint16_t)-1 &&
!xfs_sb_version_hasprojid32bit(&ip->i_mount->m_sb))
return -EINVAL;
-
-   /*
-* Project Qu

[PATCH 2/4] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
need to ensure that userspace can't continue to write the file after the
file becomes immutable.  To make that happen, we have to flush all the
dirty pagecache pages to disk to ensure that we can fail a page fault on
a mmap'd region, wait for pending directio to complete, and hope the
caller locked out any new writes by holding the inode lock.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   21 +++--
 include/linux/fs.h |   11 +++
 2 files changed, 30 insertions(+), 2 deletions(-)


diff --git a/fs/inode.c b/fs/inode.c
index f08711b34341..65a412af3ffb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2193,7 +2193,8 @@ EXPORT_SYMBOL(current_time);
 
 /*
  * Generic function to check FS_IOC_SETFLAGS values and reject any invalid
- * configurations.
+ * configurations.  Once we're done, prepare the inode for whatever changes
+ * are coming down the pipeline.
  *
  * Note: the caller should be holding i_mutex, or else be sure that they have
  * exclusive access to the inode structure.
@@ -2201,6 +2202,8 @@ EXPORT_SYMBOL(current_time);
 int vfs_ioc_setflags_prepare(struct inode *inode, unsigned int oldflags,
 unsigned int flags)
 {
+   int ret;
+
/*
 * The IMMUTABLE and APPEND_ONLY flags can only be changed by
 * the relevant capability.
@@ -2211,7 +2214,21 @@ int vfs_ioc_setflags_prepare(struct inode *inode, 
unsigned int oldflags,
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
-   return 0;
+   /*
+* Now that we're done checking the new flags, flush all pending IO and
+* dirty mappings before setting S_IMMUTABLE on an inode via
+* FS_IOC_SETFLAGS.  If the flush fails we'll clear the flag before
+* returning error.
+*/
+   if (!S_ISREG(inode->i_mode) || IS_IMMUTABLE(inode) ||
+   !(flags & FS_IMMUTABLE_FL))
+   return 0;
+
+   inode_set_flags(inode, S_IMMUTABLE, S_IMMUTABLE);
+   ret = inode_drain_writes(inode);
+   if (ret)
+   inode_set_flags(inode, 0, S_IMMUTABLE);
+   return ret;
 }
 EXPORT_SYMBOL(vfs_ioc_setflags_prepare);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 91482ab4556a..0efe749de577 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3567,4 +3567,15 @@ static inline void simple_fill_fsxattr(struct fsxattr 
*fa, __u32 xflags)
fa->fsx_xflags = xflags;
 }
 
+/*
+ * Flush file data before changing attributes.  Caller must hold any locks
+ * required to prevent further writes to this file until we're done setting
+ * flags.
+ */
+static inline int inode_drain_writes(struct inode *inode)
+{
+   inode_dio_wait(inode);
+   return filemap_write_and_wait(inode->i_mapping);
+}
+
 #endif /* _LINUX_FS_H */

[PATCH 4/4] vfs: don't allow most setxattr to immutable files

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

However, we don't actually check the immutable flag in the setattr code,
which means that we can update inode flags and project ids and extent
size hints on supposedly immutable files.  Therefore, reject setflags
and fssetxattr calls on an immutable file if the file is immutable and
will remain that way.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   27 +++
 1 file changed, 27 insertions(+)


diff --git a/fs/inode.c b/fs/inode.c
index cf07378e5731..4261c709e50e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2214,6 +2214,14 @@ int vfs_ioc_setflags_prepare(struct inode *inode, 
unsigned int oldflags,
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
+   /*
+* We aren't allowed to change any other flags if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((oldflags & FS_IMMUTABLE_FL) && (flags & FS_IMMUTABLE_FL) &&
+   oldflags != flags)
+   return -EPERM;
+
/*
 * Now that we're done checking the new flags, flush all pending IO and
 * dirty mappings before setting S_IMMUTABLE on an inode via
@@ -2284,6 +2292,25 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
 
+   /*
+* We aren't allowed to change any fields if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((old_fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
+   (fa->fsx_xflags & FS_XFLAG_IMMUTABLE)) {
+   if (old_fa->fsx_xflags != fa->fsx_xflags)
+   return -EPERM;
+   if (old_fa->fsx_projid != fa->fsx_projid)
+   return -EPERM;
+   if ((fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+  FS_XFLAG_EXTSZINHERIT)) &&
+   old_fa->fsx_extsize != fa->fsx_extsize)
+   return -EPERM;
+   if ((old_fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
+   old_fa->fsx_cowextsize != fa->fsx_cowextsize)
+   return -EPERM;
+   }
+
/* Extent size hints of zero turn off the flags. */
if (fa->fsx_extsize == 0)
fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);

[PATCH 1/4] mm/fs: don't allow writes to immutable files

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

Once the flag is set, it is enforced for quite a few file operations,
such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
don't check for immutability when doing a write(), a PROT_WRITE mmap(),
a truncate(), or a write to a previously established mmap.

If a program has an open write fd to a file that the administrator
subsequently marks immutable, the program still can change the file
contents.  Weird!

The ability to write to an immutable file does not follow the manpage
promise that immutable files cannot be modified.  Worse yet it's
inconsistent with the behavior of other syscalls which don't allow
modifications of immutable files.

Therefore, add the necessary checks to make the write, mmap, and
truncate behavior consistent with what the manpage says and consistent
with other syscalls on filesystems which support IMMUTABLE.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara 
---
 fs/attr.c|   13 ++---
 mm/filemap.c |3 +++
 mm/memory.c  |4 
 mm/mmap.c|8 ++--
 4 files changed, 19 insertions(+), 9 deletions(-)


diff --git a/fs/attr.c b/fs/attr.c
index d22e8187477f..1fcfdcc5b367 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -233,19 +233,18 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr, struct inode **de
 
WARN_ON_ONCE(!inode_is_locked(inode));
 
-   if (ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) {
-   if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   return -EPERM;
-   }
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
+   if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
+   IS_APPEND(inode))
+   return -EPERM;
 
/*
 * If utimes(2) and friends are called with times == NULL (or both
 * times are UTIME_NOW), then we need to check for write permission
 */
if (ia_valid & ATTR_TOUCH) {
-   if (IS_IMMUTABLE(inode))
-   return -EPERM;
-
if (!inode_owner_or_capable(inode)) {
error = inode_permission(inode, MAY_WRITE);
if (error)
diff --git a/mm/filemap.c b/mm/filemap.c
index aac71aef4c61..dad85e10f5f8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2935,6 +2935,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
loff_t count;
int ret;
 
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
if (!iov_iter_count(from))
return 0;
 
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..abf795277f36 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2235,6 +2235,10 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
+   if (vmf->vma->vm_file &&
+   IS_IMMUTABLE(vmf->vma->vm_file->f_mapping->host))
+   return VM_FAULT_SIGBUS;
+
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
/* Restore original flags so that caller is not surprised */
vmf->flags = old_flags;
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..b3ebca2702bf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1483,8 +1483,12 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
-   if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
-   return -EACCES;
+   if (prot & PROT_WRITE) {
+   if (!(file->f_mode & FMODE_WRITE))
+   return -EACCES;
+   if (IS_IMMUTABLE(file->f_mapping->host))
+   return -EPERM;
+   }
 
/*
 * Make sure we don't allow writing to an append-only

[PATCH 1/5] vfs: create a generic checking and prep function for FS_IOC_SETFLAGS

2019-06-28 Thread Darrick J. Wong

From: Darrick J. Wong 

Create a generic function to check incoming FS_IOC_SETFLAGS flag values
and later prepare the inode for updates so that we can standardize the
implementations that follow ext4's flag values.

Note that the efivarfs implementation no longer fails a no-op SETFLAGS
without CAP_LINUX_IMMUTABLE since that's the behavior in ext*.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Acked-by: David Sterba 
Reviewed-by: Bob Peterson 
---
 fs/btrfs/ioctl.c|   13 +
 fs/efivarfs/file.c  |   26 +-
 fs/ext2/ioctl.c |   16 
 fs/ext4/ioctl.c |   13 +++--
 fs/f2fs/file.c  |7 ---
 fs/gfs2/file.c  |   42 +-
 fs/hfsplus/ioctl.c  |   21 -
 fs/inode.c  |   24 
 fs/jfs/ioctl.c  |   22 +++---
 fs/nilfs2/ioctl.c   |9 ++---
 fs/ocfs2/ioctl.c|   13 +++--
 fs/orangefs/file.c  |   35 ++-
 fs/reiserfs/ioctl.c |   10 --
 fs/ubifs/ioctl.c|   13 +++--
 include/linux/fs.h  |3 +++
 15 files changed, 146 insertions(+), 121 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6dafa857bbb9..d3d9b4abb09b 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -187,7 +187,7 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
struct btrfs_inode *binode = BTRFS_I(inode);
struct btrfs_root *root = binode->root;
struct btrfs_trans_handle *trans;
-   unsigned int fsflags;
+   unsigned int fsflags, old_fsflags;
int ret;
const char *comp = NULL;
u32 binode_flags = binode->flags;
@@ -212,13 +212,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode_lock(inode);
 
fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
-   if ((fsflags ^ btrfs_inode_flags_to_fsflags(binode->flags)) &
-   (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE)) {
-   ret = -EPERM;
-   goto out_unlock;
-   }
-   }
+   old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
+   ret = vfs_ioc_setflags_prepare(inode, old_fsflags, fsflags);
+   if (ret)
+   goto out_unlock;
 
if (fsflags & FS_SYNC_FL)
binode_flags |= BTRFS_INODE_SYNC;
diff --git a/fs/efivarfs/file.c b/fs/efivarfs/file.c
index 8e568428c88b..a3cc10b1bfe1 100644
--- a/fs/efivarfs/file.c
+++ b/fs/efivarfs/file.c
@@ -110,16 +110,22 @@ static ssize_t efivarfs_file_read(struct file *file, char 
__user *userbuf,
return size;
 }
 
-static int
-efivarfs_ioc_getxflags(struct file *file, void __user *arg)
+static inline unsigned int efivarfs_getflags(struct inode *inode)
 {
-   struct inode *inode = file->f_mapping->host;
unsigned int i_flags;
unsigned int flags = 0;
 
i_flags = inode->i_flags;
if (i_flags & S_IMMUTABLE)
flags |= FS_IMMUTABLE_FL;
+   return flags;
+}
+
+static int
+efivarfs_ioc_getxflags(struct file *file, void __user *arg)
+{
+   struct inode *inode = file->f_mapping->host;
+   unsigned int flags = efivarfs_getflags(inode);
 
if (copy_to_user(arg, &flags, sizeof(flags)))
return -EFAULT;
@@ -132,6 +138,7 @@ efivarfs_ioc_setxflags(struct file *file, void __user *arg)
struct inode *inode = file->f_mapping->host;
unsigned int flags;
unsigned int i_flags = 0;
+   unsigned int oldflags = efivarfs_getflags(inode);
int error;
 
if (!inode_owner_or_capable(inode))
@@ -143,9 +150,6 @@ efivarfs_ioc_setxflags(struct file *file, void __user *arg)
if (flags & ~FS_IMMUTABLE_FL)
return -EOPNOTSUPP;
 
-   if (!capable(CAP_LINUX_IMMUTABLE))
-   return -EPERM;
-
if (flags & FS_IMMUTABLE_FL)
i_flags |= S_IMMUTABLE;
 
@@ -155,12 +159,16 @@ efivarfs_ioc_setxflags(struct file *file, void __user 
*arg)
return error;
 
inode_lock(inode);
+
+   error = vfs_ioc_setflags_prepare(inode, oldflags, flags);
+   if (error)
+   goto out;
+
inode_set_flags(inode, i_flags, S_IMMUTABLE);
+out:
inode_unlock(inode);
-
mnt_drop_write_file(file);
-
-   return 0;
+   return error;
 }
 
 static long
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 0367c0039e68..1b853fb0b163 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -60,18 +60,10 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
}
oldflags = ei->i_flags;
 
-   /*
-* The IMMUTABLE and APPEND_ONLY flags can only be changed by
-

[PATCH v6 0/4] vfs: make immutable files actually immutable

2019-06-28 Thread Darrick J. Wong

Hi all,

The chattr(1) manpage has this to say about the immutable bit that
system administrators can set on files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

Given the clause about how the file 'cannot be modified', it is
surprising that programs holding writable file descriptors can continue
to write to and truncate files after the immutable flag has been set,
but they cannot call other things such as utimes, fallocate, unlink,
link, setxattr, or reflink.

Since the immutable flag is only settable by administrators, resolve
this inconsistent behavior in favor of the documented behavior -- once
the flag is set, the file cannot be modified, period. We presume that
administrators must be trusted to know what they're doing, and that
cutting off programs with writable fds will probably break them.

Therefore, add immutability checks to the relevant VFS functions, then
refactor the SETFLAGS and FSSETXATTR implementations to use common
argument checking functions so that we can then force pagefaults on all
the file data when setting immutability.

Note that various distro manpages points out the inconsistent behavior
of the various Linux filesystems w.r.t. immutable. This fixes all that.

I also discovered that userspace programs can write and create writable
memory mappings to active swap files. This is extremely bad because
this allows anyone with write privileges to corrupt system memory. The
final patch in this series closes off that hole, at least for swap
files.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This has been lightly tested with fstests. Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=immutable-files

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=immutable-files

Re: [PATCH] generic: test cloning large exents to a file with many small extents

2019-06-27 Thread Darrick J. Wong

On Thu, Jun 27, 2019 at 10:47:31PM +0100, Filipe Manana wrote:
> On Thu, Jun 27, 2019 at 9:28 PM Darrick J. Wong  
> wrote:
> >
> > On Thu, Jun 27, 2019 at 06:00:30PM +0100, fdman...@kernel.org wrote:
> > > From: Filipe Manana 
> > >
> > > Test that if we clone a file with some large extents into a file that has
> > > many small extents, when the fs is nearly full, the clone operation does
> > > not fail and produces the correct result.
> > >
> > > This is motivated by a bug found in btrfs wich is fixed by the following
> > > patches for the linux kernel:
> > >
> > >  [PATCH 1/2] Btrfs: factor out extent dropping code from hole punch 
> > > handler
> > >  [PATCH 2/2] Btrfs: fix ENOSPC errors, leading to transaction aborts, when
> > >  cloning extents
> > >
> > > The test currently passes on xfs.
> > >
> > > Signed-off-by: Filipe Manana 
> > > ---
> > >  tests/generic/558 | 75 
> > > +++
> > >  tests/generic/558.out |  5 
> > >  tests/generic/group   |  1 +
> > >  3 files changed, 81 insertions(+)
> > >  create mode 100755 tests/generic/558
> > >  create mode 100644 tests/generic/558.out
> > >
> > > diff --git a/tests/generic/558 b/tests/generic/558
> > > new file mode 100755
> > > index ..ee16cdf7
> > > --- /dev/null
> > > +++ b/tests/generic/558
> > > @@ -0,0 +1,75 @@
> > > +#! /bin/bash
> > > +# SPDX-License-Identifier: GPL-2.0
> > > +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> > > +#
> > > +# FSQA Test No. 558
> > > +#
> > > +# Test that if we clone a file with some large extents into a file that 
> > > has
> > > +# many small extents, when the fs is nearly full, the clone operation 
> > > does
> > > +# not fail and produces the correct result.
> > > +#
> > > +seq=`basename $0`
> > > +seqres=$RESULT_DIR/$seq
> > > +echo "QA output created by $seq"
> > > +tmp=/tmp/$$
> > > +status=1 # failure is the default!
> > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > +
> > > +_cleanup()
> > > +{
> > > + cd /
> > > + rm -f $tmp.*
> > > +}
> > > +
> > > +# get standard environment, filters and checks
> > > +. ./common/rc
> > > +. ./common/filter
> > > +. ./common/reflink
> > > +
> > > +# real QA test starts here
> > > +_supported_fs generic
> > > +_supported_os Linux
> > > +_require_scratch_reflink
> > > +
> > > +rm -f $seqres.full
> > > +
> > > +_scratch_mkfs_sized $((512 * 1024 * 1024)) >>$seqres.full 2>&1
> > > +_scratch_mount
> > > +
> > > +file_size=$(( 128 * 1024 * 1024 )) # 128Mb
> > > +extent_size=4096
> >
> > What if the fs block size is 64k?
> 
> Then we get extents of 64Kb instead of 4Kb. Works on btrfs. Is it a
> problem for xfs (or any other fs)?

It shouldn't be; I was merely wondering if 2048 extents was enough to
trigger the enospc on btrfs.

> >
> > > +num_extents=$(( $file_size / $extent_size ))
> > > +
> > > +# Create a file with many small extents.
> > > +for ((i = 0; i < $num_extents; i++)); do
> > > + offset=$(( $i * $extent_size ))
> > > + $XFS_IO_PROG -f -s -c "pwrite -S 0xe5 $offset $extent_size" \
> > > + $SCRATCH_MNT/foo >>/dev/null
> > > +done
> >
> > I wouldn't have thought that this would actually succeed on xfs because
> > you could lay extents down one after the other, but then started seeing
> > this in the filefrag output:
> >
> > File size of /opt/foo is 528384 (129 blocks of 4096 bytes)
> >  ext: logical_offset:physical_offset: length:   expected: flags:
> >3:   17..  17: 52..52:  1: 37:
> >4:   18..  18: 67..67:  1: 53:
> >5:   19..  19: 81..81:  1: 68:
> >6:   20..  20: 94..94:  1: 82:
> >7:   21..  21:106..   106:  1: 95:
> >8:   22..  22:117..   117:  1:107:
> >9:   23..  23:127..   127:  1:118:
> >   10:   24..  24:136..

Re: [PATCH] generic: test cloning large exents to a file with many small extents

2019-06-27 Thread Darrick J. Wong

On Thu, Jun 27, 2019 at 06:00:30PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Test that if we clone a file with some large extents into a file that has
> many small extents, when the fs is nearly full, the clone operation does
> not fail and produces the correct result.
> 
> This is motivated by a bug found in btrfs wich is fixed by the following
> patches for the linux kernel:
> 
>  [PATCH 1/2] Btrfs: factor out extent dropping code from hole punch handler
>  [PATCH 2/2] Btrfs: fix ENOSPC errors, leading to transaction aborts, when
>  cloning extents
> 
> The test currently passes on xfs.
> 
> Signed-off-by: Filipe Manana 
> ---
>  tests/generic/558 | 75 
> +++
>  tests/generic/558.out |  5 
>  tests/generic/group   |  1 +
>  3 files changed, 81 insertions(+)
>  create mode 100755 tests/generic/558
>  create mode 100644 tests/generic/558.out
> 
> diff --git a/tests/generic/558 b/tests/generic/558
> new file mode 100755
> index ..ee16cdf7
> --- /dev/null
> +++ b/tests/generic/558
> @@ -0,0 +1,75 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (C) 2019 SUSE Linux Products GmbH. All Rights Reserved.
> +#
> +# FSQA Test No. 558
> +#
> +# Test that if we clone a file with some large extents into a file that has
> +# many small extents, when the fs is nearly full, the clone operation does
> +# not fail and produces the correct result.
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/reflink
> +
> +# real QA test starts here
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch_reflink
> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs_sized $((512 * 1024 * 1024)) >>$seqres.full 2>&1
> +_scratch_mount
> +
> +file_size=$(( 128 * 1024 * 1024 )) # 128Mb
> +extent_size=4096

What if the fs block size is 64k?

> +num_extents=$(( $file_size / $extent_size ))
> +
> +# Create a file with many small extents.
> +for ((i = 0; i < $num_extents; i++)); do
> + offset=$(( $i * $extent_size ))
> + $XFS_IO_PROG -f -s -c "pwrite -S 0xe5 $offset $extent_size" \
> + $SCRATCH_MNT/foo >>/dev/null
> +done

I wouldn't have thought that this would actually succeed on xfs because
you could lay extents down one after the other, but then started seeing
this in the filefrag output:

File size of /opt/foo is 528384 (129 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   3:   17..  17: 52..52:  1: 37:
   4:   18..  18: 67..67:  1: 53:
   5:   19..  19: 81..81:  1: 68:
   6:   20..  20: 94..94:  1: 82:
   7:   21..  21:106..   106:  1: 95:
   8:   22..  22:117..   117:  1:107:
   9:   23..  23:127..   127:  1:118:
  10:   24..  24:136..   136:  1:128:
  11:   25..  25:144..   144:  1:137:
  12:   26..  26:151..   151:  1:145:
  13:   27..  27:157..   157:  1:152:
  14:   28..  28:162..   162:  1:158:
  15:   29..  29:166..   166:  1:163:
  16:   30..  30:169..   169:  1:167:
  17:   31..  32:171..   172:  2:170:
  18:   33..  33:188..   188:  1:173:

52, 67, 81, 94, 106, 117, 127, 136, 44, 151, 157, 162, 166, 169, 171...

 +15 +14 +13 +12  +11  +10   +9   +8  +7   +6   +5   +4   +3   +2...

Hm, I wonder what quirk of the xfs allocator this might be?

> +
> +# Create file bar with the same size that file foo has but with large 
> extents.
> +$XFS_IO_PROG -f -c "pwrite -S 0xc7 -b $file_size 0 $file_size" \
> + $SCRATCH_MNT/bar >>/dev/null
> +
> +# Fill the fs (for btrfs we are interested in filling all unallocated space
> +# and most of the existing metadata block group(s), so that after this there
> +# will be no unallocated space and metadata space will be mostly full but 
> with
> +# more than enough free space for the clone operation below to succeed).
> +i=1
> +while true; do
> + $XFS_IO_PROG -f -c "pwrite 0 2K" $SCRATCH_MNT/filler_$i &> /dev/null
> + [ $? -ne 0 ] && break
> + i=$(( i + 1 ))
> +done

_fill_fs?

> +
> +# Now clone file bar into file foo. This is supposed to succeed and not fail
> +# with ENOSPC for example.
> +$XFS_IO_PROG -c "reflink $SCRATCH_MNT/bar" $SCRATCH_MNT/foo >>/dev/null

_reflink

Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

2019-06-26 Thread Darrick J. Wong

On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> On  9:07 24/06, Christoph Hellwig wrote:
> > xfs will need to be updated to fill in the additional iomap for the
> > COW case.  Has this series been tested on xfs?
> > 
> 
> No, I have not tested this, or make xfs set IOMAP_COW. I will try to do
> it in the next iteration.

AFAICT even if you did absolutely nothing XFS would continue to work
properly because iomap_write_begin doesn't actually care if it's going
to be a COW write because the only IO it does from the mapping is to
read in the non-uptodate parts of the page if the write offset/len
aren't page-aligned.

> > I can't say I'm a huge fan of this two iomaps in one method call
> > approach.  I always though two separate iomap iterations would be nicer,
> > but compared to that even the older hack with just the additional
> > src_addr seems a little better.
> 
> I am just expanding on your idea of using multiple iterations for the Cow case
> in the hope we can come out of a good design:
> 
> 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
>which calls iomap_begin() for the respective filesystem.
> 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
>struct with read addr information.
> 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
>and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).

Unless I'm misreading this, you don't need a do_cow() or
IOMAP_COW_READ_DONE because the page state tracks that for you:

iomap_write_begin calls ->iomap_begin to learn from where it should read
data if the write is not aligned to a page and the page isn't uptodate.
If it's IOMAP_COW then we learn from *srcmap instead of *iomap.

(The write actor then dirties the page)

fsync() or whatever

The mm calls ->writepage.  The filesystem grabs the new COW mapping,
constructs a bio with the new mapping and dirty pages, and submits the
bio.  pagesize >= blocksize so we're always writing full blocks.

The writeback bio completes and calls ->bio_endio, which is the
filesystem's trigger to make the mapping changes permanent, update
ondisk file size, etc.

For direct writes that are not block-aligned, we just bounce the write
to the page cache...

...so it's only dax_iomap_rw where we're going to have to do the COW
ourselves.  That's simple -- map both addresses, copy the regions before
offset and after offset+len, then proceed with writing whatever
userspace sent us.  No need for the iomap code itself to get involved.

> 4. btrfs_iomap_begin() fills up iomap structure with write information.
> 
> Step 3 seems out of place because iomap_apply should be iomap.type agnostic.
> Right?
> Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> this is the "real" write for iomap_begin to fill iomap?
> 
> If this is not how you imagined, could you elaborate on the dual iteration
> sequence?

--D

> 
> 
> -- 
> Goldwyn

Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

2019-06-26 Thread Darrick J. Wong

On Wed, Jun 26, 2019 at 11:10:17AM -0500, Goldwyn Rodrigues wrote:
> On  8:39 26/06, Christoph Hellwig wrote:
> > On Tue, Jun 25, 2019 at 02:14:42PM -0500, Goldwyn Rodrigues wrote:
> > > > I can't say I'm a huge fan of this two iomaps in one method call
> > > > approach.  I always though two separate iomap iterations would be nicer,
> > > > but compared to that even the older hack with just the additional
> > > > src_addr seems a little better.
> > > 
> > > I am just expanding on your idea of using multiple iterations for the Cow 
> > > case
> > > in the hope we can come out of a good design:
> > > 
> > > 1. iomap_file_buffered_write calls iomap_apply with IOMAP_WRITE flag.
> > >which calls iomap_begin() for the respective filesystem.
> > > 2. btrfs_iomap_begin() sets up iomap->type as IOMAP_COW and fills iomap
> > >struct with read addr information.
> > > 3. iomap_apply() conditionally for IOMAP_COW calls do_cow(new function)
> > >and calls ops->iomap_begin() with flag IOMAP_COW_READ_DONE(new flag).
> > > 4. btrfs_iomap_begin() fills up iomap structure with write information.
> > > 
> > > Step 3 seems out of place because iomap_apply should be iomap.type 
> > > agnostic.
> > > Right?
> > > Should we be adding another flag IOMAP_COW_DONE, just to figure out that
> > > this is the "real" write for iomap_begin to fill iomap?
> > > 
> > > If this is not how you imagined, could you elaborate on the dual iteration
> > > sequence?
> > 
> > Here are my thoughts from dealing with this from a while ago, all
> > XFS based of course.
> > 
> > If iomap_file_buffered_write is called on a page that is inside a COW
> > extent we have the following options:
> > 
> >  a) the page is updatodate or entirely overwritten.  We cn just allocate
> > new COW blocks and return them, and we are done
> >  b) the page is not/partially uptodate and not entirely overwritten.
> > 
> > The latter case is the interesting one.  My thought was that iff the
> > IOMAP_F_SHARED flag is set __iomap_write_begin / iomap_read_page_sync
> > will then have to retreive the source information in some form.
> > 
> > My original plan was to just do a nested iomap_apply call, which would
> > need a special nested flag to not duplicate any locking the file
> > system might be holding between ->iomap_begin and ->iomap_end.
> > 
> > The upside here is that there is no additional overhead for the non-COW
> > path and the architecture looks relatively clean.  The downside is that
> > at least for XFS we usually have to look up the source information
> > anyway before allocating the COW destination extent, so we'd have to
> > cache that information somewhere or redo it, which would be rather
> > pointless.  At that point the idea of a srcaddr in the iomap becomes
> > interesting again - while it looks a little ugly from the architectural
> > POV it actually ends up having very practical benefits.

I think it's less complicated to pass both mappings out in a single
->iomap_begin call rather than have this dance where the fs tells iomap
to call back for the read mapping and then iomap calls back for the read
mapping with a special "don't take locks" flag.

For XFS specifically this means we can serve both mappings with a single
ILOCK cycle.

> So, do we move back to the design of adding an extra field of srcaddr?

TLDR: Please no.

> Honestly, I find the design of using an extra field srcaddr in iomap better
> and simpler versus passing additional iomap srcmap or multiple iterations.

Putting my long-range planning hat on, the usage story (from the fs'
perspective) here is:

"iomap wants to know how a file write should map to a disk write.  If
we're doing a straight overwrite of disk blocks then I should send back
the relevant mapping.  Sometimes I might need the write to go to a
totally different location than where the data is currently stored, so I
need to send back a second mapping."

Because iomap is now a general-purpose API, we need to think about the
read mapping for a moment:

 - For all disk-based filesystems we'll want the source address for the
   read mapping.

 - For filesystems that support "inline data" (which really just means
   the fs maintains its own buffers to file data) we'll also need the
   inline_data pointer.

 - For filesystems that support multiple devices (like btrfs) we'll also
   need a pointer to a block_device because we could be writing to a
   different device than the one that stores the data.  The prime
   example I can think of is reading data from disk A in some RAID
   stripe and writing to disk B in a different RAID stripe to solve the
   RAID5 hole... but you could just be lazy-migrating file data to less
   full or newer drives or whatever.

 - If we ever encounter a filesystem that supports multiple dax devices
   then we'll need a pointer to the dax_device too.  (That would be
   btrfs, since I thought your larger goal was to enable dax there...)

 - We're probably going to need the ability to pass fla

Re: [PATCH 5/5] vfs: don't allow writes to swap files

2019-06-26 Thread Darrick J. Wong

On Wed, Jun 26, 2019 at 04:51:51AM +0100, Al Viro wrote:
> On Tue, Jun 25, 2019 at 07:33:31PM -0700, Darrick J. Wong wrote:
> > --- a/fs/attr.c
> > +++ b/fs/attr.c
> > @@ -236,6 +236,9 @@ int notify_change(struct dentry * dentry, struct iattr 
> > * attr, struct inode **de
> > if (IS_IMMUTABLE(inode))
> > return -EPERM;
> >  
> > +   if (IS_SWAPFILE(inode))
> > +   return -ETXTBSY;
> > +
> > if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
> > IS_APPEND(inode))
> > return -EPERM;
> 
> Er...  So why exactly is e.g. chmod(2) forbidden for swapfiles?  Or touch(1),
> for that matter...

Oops, that check is overly broad; I think the only attribute change we
need to filter here is ATTR_SIZE which we could do unconditionally
in inode_newsize_ok.

What's the use case for allowing userspace to increase the size of an
active swapfile?  I don't see any; the kernel has a permanent lease on
the file space mapping (at least until swapoff)...

> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 596ac98051c5..1ca4ee8c2d60 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3165,6 +3165,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
> > specialfile, int, swap_flags)
> > if (error)
> > goto bad_swap;
> >  
> > +   /*
> > +* Flush any pending IO and dirty mappings before we start using this
> > +* swap file.
> > +*/
> > +   if (S_ISREG(inode->i_mode)) {
> > +   inode->i_flags |= S_SWAPFILE;
> > +   error = inode_drain_writes(inode);
> > +   if (error) {
> > +   inode->i_flags &= ~S_SWAPFILE;
> > +   goto bad_swap;
> > +   }
> > +   }
> 
> Why are swap partitions any less worthy of protection?

Hmm, yeah, S_SWAPFILE should apply to block devices too.  I figured that
the mantra of "sane tools will open block devices with O_EXCL" should
have sufficed, but there's really no reason to allow that either.

--D

[PATCH 5/5] vfs: don't allow writes to swap files

2019-06-25 Thread Darrick J. Wong

From: Darrick J. Wong 

Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong 
---
 fs/attr.c |3 +++
 mm/filemap.c  |3 +++
 mm/memory.c   |4 +++-
 mm/mmap.c |2 ++
 mm/swapfile.c |   15 +--
 5 files changed, 24 insertions(+), 3 deletions(-)


diff --git a/fs/attr.c b/fs/attr.c
index 1fcfdcc5b367..42f4d4fb0631 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -236,6 +236,9 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr, struct inode **de
if (IS_IMMUTABLE(inode))
return -EPERM;
 
+   if (IS_SWAPFILE(inode))
+   return -ETXTBSY;
+
if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
IS_APPEND(inode))
return -EPERM;
diff --git a/mm/filemap.c b/mm/filemap.c
index dad85e10f5f8..fd80bc20e30a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2938,6 +2938,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
if (IS_IMMUTABLE(inode))
return -EPERM;
 
+   if (IS_SWAPFILE(inode))
+   return -ETXTBSY;
+
if (!iov_iter_count(from))
return 0;
 
diff --git a/mm/memory.c b/mm/memory.c
index 4311cfdade90..c04c6a689995 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2235,7 +2235,9 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
-   if (vmf->vma->vm_file && IS_IMMUTABLE(file_inode(vmf->vma->vm_file)))
+   if (vmf->vma->vm_file &&
+   (IS_IMMUTABLE(file_inode(vmf->vma->vm_file)) ||
+IS_SWAPFILE(file_inode(vmf->vma->vm_file
return VM_FAULT_SIGBUS;
 
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
diff --git a/mm/mmap.c b/mm/mmap.c
index ac1e32205237..031807339869 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1488,6 +1488,8 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
return -EACCES;
if (IS_IMMUTABLE(file_inode(file)))
return -EPERM;
+   if (IS_SWAPFILE(file_inode(file)))
+   return -ETXTBSY;
}
 
/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596ac98051c5..1ca4ee8c2d60 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3165,6 +3165,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
if (error)
goto bad_swap;
 
+   /*
+* Flush any pending IO and dirty mappings before we start using this
+* swap file.
+*/
+   if (S_ISREG(inode->i_mode)) {
+   inode->i_flags |= S_SWAPFILE;
+   error = inode_drain_writes(inode);
+   if (error) {
+   inode->i_flags &= ~S_SWAPFILE;
+   goto bad_swap;
+   }
+   }
+
mutex_lock(&swapon_mutex);
prio = -1;
if (swap_flags & SWAP_FLAG_PREFER)
@@ -3185,8 +3198,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, 
int, swap_flags)
atomic_inc(&proc_poll_event);
wake_up_interruptible(&proc_poll_wait);
 
-   if (S_ISREG(inode->i_mode))
-   inode->i_flags |= S_SWAPFILE;
error = 0;
goto out;
 bad_swap:

[PATCH 3/5] vfs: flush and wait for io when setting the immutable flag via FSSETXATTR

2019-06-25 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_FSSETXATTR to set the immutable flag on a file,
we need to ensure that userspace can't continue to write the file after
the file becomes immutable.  To make that happen, we have to flush all
the dirty pagecache pages to disk to ensure that we can fail a page
fault on a mmap'd region, wait for pending directio to complete, and
hope the caller locked out any new writes by holding the inode lock.

XFS has more complex locking than other FSSETXATTR implementations so we
have to keep the checking and preparation code in different functions.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c   |2 +
 fs/ext4/ioctl.c|2 +
 fs/f2fs/file.c |2 +
 fs/inode.c |   31 +++
 fs/xfs/xfs_ioctl.c |   71 +++-
 include/linux/fs.h |3 ++
 6 files changed, 90 insertions(+), 21 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0f5af7c5f66b..bbd6d908900e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -423,7 +423,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
old_flags = binode->flags;
old_i_flags = inode->i_flags;
 
-   ret = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   ret = vfs_ioc_fssetxattr_prepare(inode, &old_fa, &fa);
if (ret)
goto out_unlock;
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 1e88c3af9a8d..146587c3fe8e 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1109,7 +1109,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
 
inode_lock(inode);
ext4_fill_fsxattr(inode, &old_fa);
-   err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   err = vfs_ioc_fssetxattr_prepare(inode, &old_fa, &fa);
if (err)
goto out;
flags = (ei->i_flags & ~EXT4_FL_XFLAG_VISIBLE) |
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index d6ed319388d6..af0fc040a15c 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2826,7 +2826,7 @@ static int f2fs_ioc_fssetxattr(struct file *filp, 
unsigned long arg)
inode_lock(inode);
 
f2fs_fill_fsxattr(inode, &old_fa);
-   err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   err = vfs_ioc_fssetxattr_prepare(inode, &old_fa, &fa);
if (err)
goto out;
flags = (fi->i_flags & ~F2FS_FL_XFLAG_VISIBLE) |
diff --git a/fs/inode.c b/fs/inode.c
index 65a412af3ffb..cf07378e5731 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2293,3 +2293,34 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
return 0;
 }
 EXPORT_SYMBOL(vfs_ioc_fssetxattr_check);
+
+/*
+ * Generic function to check FS_IOC_FSSETXATTR values and reject any invalid
+ * configurations.  If none are found, flush all pending IO and dirty mappings
+ * before setting S_IMMUTABLE on an inode.  If the flush fails we'll clear the
+ * flag before returning error.
+ *
+ * Note: the caller must hold whatever locks are necessary to block any other
+ * threads from starting a write to the file.
+ */
+int vfs_ioc_fssetxattr_prepare(struct inode *inode,
+  const struct fsxattr *old_fa,
+  struct fsxattr *fa)
+{
+   int ret;
+
+   ret = vfs_ioc_fssetxattr_check(inode, old_fa, fa);
+   if (ret)
+   return ret;
+
+   if (!S_ISREG(inode->i_mode) || IS_IMMUTABLE(inode) ||
+   !(fa->fsx_xflags & FS_XFLAG_IMMUTABLE))
+   return 0;
+
+   inode_set_flags(inode, S_IMMUTABLE, S_IMMUTABLE);
+   ret = inode_drain_writes(inode);
+   if (ret)
+   inode_set_flags(inode, 0, S_IMMUTABLE);
+   return ret;
+}
+EXPORT_SYMBOL(vfs_ioc_fssetxattr_prepare);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 011657bd50ca..723550c8a2e4 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1058,6 +1058,30 @@ xfs_ioctl_setattr_xflags(
return 0;
 }
 
+/*
+ * If we're setting immutable on a regular file, we need to prevent new writes.
+ * Once we've done that, we must wait for all the other writes to complete.
+ *
+ * The caller must use @join_flags to release the locks which are held on @ip
+ * regardless of return value.
+ */
+static int
+xfs_ioctl_setattr_drain_writes(
+   struct xfs_inode*ip,
+   const struct fsxattr*fa,
+   int *join_flags)
+{
+   struct inode*inode = VFS_I(ip);
+
+   if (!S_ISREG(inode->i_mode) || !(fa->fsx_xflags & FS_XFLAG_IMMUTABLE))
+   return 0;
+
+   *join_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+   xfs_ilock(ip, *join_flags);
+
+   return inode_drain_writes(inode);
+}
+
 /*
  * If we are changing DA

[PATCH 2/5] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-25 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
need to ensure that userspace can't continue to write the file after the
file becomes immutable.  To make that happen, we have to flush all the
dirty pagecache pages to disk to ensure that we can fail a page fault on
a mmap'd region, wait for pending directio to complete, and hope the
caller locked out any new writes by holding the inode lock.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   21 +++--
 include/linux/fs.h |   11 +++
 2 files changed, 30 insertions(+), 2 deletions(-)


diff --git a/fs/inode.c b/fs/inode.c
index f08711b34341..65a412af3ffb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2193,7 +2193,8 @@ EXPORT_SYMBOL(current_time);
 
 /*
  * Generic function to check FS_IOC_SETFLAGS values and reject any invalid
- * configurations.
+ * configurations.  Once we're done, prepare the inode for whatever changes
+ * are coming down the pipeline.
  *
  * Note: the caller should be holding i_mutex, or else be sure that they have
  * exclusive access to the inode structure.
@@ -2201,6 +2202,8 @@ EXPORT_SYMBOL(current_time);
 int vfs_ioc_setflags_prepare(struct inode *inode, unsigned int oldflags,
 unsigned int flags)
 {
+   int ret;
+
/*
 * The IMMUTABLE and APPEND_ONLY flags can only be changed by
 * the relevant capability.
@@ -2211,7 +2214,21 @@ int vfs_ioc_setflags_prepare(struct inode *inode, 
unsigned int oldflags,
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
-   return 0;
+   /*
+* Now that we're done checking the new flags, flush all pending IO and
+* dirty mappings before setting S_IMMUTABLE on an inode via
+* FS_IOC_SETFLAGS.  If the flush fails we'll clear the flag before
+* returning error.
+*/
+   if (!S_ISREG(inode->i_mode) || IS_IMMUTABLE(inode) ||
+   !(flags & FS_IMMUTABLE_FL))
+   return 0;
+
+   inode_set_flags(inode, S_IMMUTABLE, S_IMMUTABLE);
+   ret = inode_drain_writes(inode);
+   if (ret)
+   inode_set_flags(inode, 0, S_IMMUTABLE);
+   return ret;
 }
 EXPORT_SYMBOL(vfs_ioc_setflags_prepare);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 48322bfd7299..51266c9dbadc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3561,4 +3561,15 @@ int vfs_ioc_setflags_prepare(struct inode *inode, 
unsigned int oldflags,
 int vfs_ioc_fssetxattr_check(struct inode *inode, const struct fsxattr *old_fa,
 struct fsxattr *fa);
 
+/*
+ * Flush file data before changing attributes.  Caller must hold any locks
+ * required to prevent further writes to this file until we're done setting
+ * flags.
+ */
+static inline int inode_drain_writes(struct inode *inode)
+{
+   inode_dio_wait(inode);
+   return filemap_write_and_wait(inode->i_mapping);
+}
+
 #endif /* _LINUX_FS_H */

[PATCH 4/5] vfs: don't allow most setxattr to immutable files

2019-06-25 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

However, we don't actually check the immutable flag in the setattr code,
which means that we can update inode flags and project ids and extent
size hints on supposedly immutable files.  Therefore, reject setflags
and fssetxattr calls on an immutable file if the file is immutable and
will remain that way.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   27 +++
 1 file changed, 27 insertions(+)


diff --git a/fs/inode.c b/fs/inode.c
index cf07378e5731..4261c709e50e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2214,6 +2214,14 @@ int vfs_ioc_setflags_prepare(struct inode *inode, 
unsigned int oldflags,
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
+   /*
+* We aren't allowed to change any other flags if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((oldflags & FS_IMMUTABLE_FL) && (flags & FS_IMMUTABLE_FL) &&
+   oldflags != flags)
+   return -EPERM;
+
/*
 * Now that we're done checking the new flags, flush all pending IO and
 * dirty mappings before setting S_IMMUTABLE on an inode via
@@ -2284,6 +2292,25 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
!(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode)))
return -EINVAL;
 
+   /*
+* We aren't allowed to change any fields if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((old_fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
+   (fa->fsx_xflags & FS_XFLAG_IMMUTABLE)) {
+   if (old_fa->fsx_xflags != fa->fsx_xflags)
+   return -EPERM;
+   if (old_fa->fsx_projid != fa->fsx_projid)
+   return -EPERM;
+   if ((fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+  FS_XFLAG_EXTSZINHERIT)) &&
+   old_fa->fsx_extsize != fa->fsx_extsize)
+   return -EPERM;
+   if ((old_fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
+   old_fa->fsx_cowextsize != fa->fsx_cowextsize)
+   return -EPERM;
+   }
+
/* Extent size hints of zero turn off the flags. */
if (fa->fsx_extsize == 0)
fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);

[PATCH 1/5] mm/fs: don't allow writes to immutable files

2019-06-25 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

Once the flag is set, it is enforced for quite a few file operations,
such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
don't check for immutability when doing a write(), a PROT_WRITE mmap(),
a truncate(), or a write to a previously established mmap.

If a program has an open write fd to a file that the administrator
subsequently marks immutable, the program still can change the file
contents.  Weird!

The ability to write to an immutable file does not follow the manpage
promise that immutable files cannot be modified.  Worse yet it's
inconsistent with the behavior of other syscalls which don't allow
modifications of immutable files.

Therefore, add the necessary checks to make the write, mmap, and
truncate behavior consistent with what the manpage says and consistent
with other syscalls on filesystems which support IMMUTABLE.

Signed-off-by: Darrick J. Wong 
Reviewed-by: Jan Kara 
---
 fs/attr.c|   13 ++---
 mm/filemap.c |3 +++
 mm/memory.c  |3 +++
 mm/mmap.c|8 ++--
 4 files changed, 18 insertions(+), 9 deletions(-)


diff --git a/fs/attr.c b/fs/attr.c
index d22e8187477f..1fcfdcc5b367 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -233,19 +233,18 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr, struct inode **de
 
WARN_ON_ONCE(!inode_is_locked(inode));
 
-   if (ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) {
-   if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   return -EPERM;
-   }
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
+   if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
+   IS_APPEND(inode))
+   return -EPERM;
 
/*
 * If utimes(2) and friends are called with times == NULL (or both
 * times are UTIME_NOW), then we need to check for write permission
 */
if (ia_valid & ATTR_TOUCH) {
-   if (IS_IMMUTABLE(inode))
-   return -EPERM;
-
if (!inode_owner_or_capable(inode)) {
error = inode_permission(inode, MAY_WRITE);
if (error)
diff --git a/mm/filemap.c b/mm/filemap.c
index aac71aef4c61..dad85e10f5f8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2935,6 +2935,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
loff_t count;
int ret;
 
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
if (!iov_iter_count(from))
return 0;
 
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..4311cfdade90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2235,6 +2235,9 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
+   if (vmf->vma->vm_file && IS_IMMUTABLE(file_inode(vmf->vma->vm_file)))
+   return VM_FAULT_SIGBUS;
+
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
/* Restore original flags so that caller is not surprised */
vmf->flags = old_flags;
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..ac1e32205237 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1483,8 +1483,12 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
-   if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
-   return -EACCES;
+   if (prot & PROT_WRITE) {
+   if (!(file->f_mode & FMODE_WRITE))
+   return -EACCES;
+   if (IS_IMMUTABLE(file_inode(file)))
+   return -EPERM;
+   }
 
/*
 * Make sure we don't allow writing to an append-only

[PATCH v5 0/5] vfs: make immutable files actually immutable

2019-06-25 Thread Darrick J. Wong

Hi all,

The chattr(1) manpage has this to say about the immutable bit that
system administrators can set on files:

Note that various distro manpages points out the inconsistent behavior
of the various Linux filesystems w.r.t. immutable. This fixes all that.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This has been lightly tested with fstests. Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=immutable-files

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=immutable-files

Re: [PATCH v4 0/7] vfs: make immutable files actually immutable

2019-06-25 Thread Darrick J. Wong

On Tue, Jun 25, 2019 at 03:36:31AM -0700, Christoph Hellwig wrote:
> On Fri, Jun 21, 2019 at 04:56:50PM -0700, Darrick J. Wong wrote:
> > Hi all,
> > 
> > The chattr(1) manpage has this to say about the immutable bit that
> > system administrators can set on files:
> > 
> > "A file with the 'i' attribute cannot be modified: it cannot be deleted
> > or renamed, no link can be created to this file, most of the file's
> > metadata can not be modified, and the file can not be opened in write
> > mode."
> > 
> > Given the clause about how the file 'cannot be modified', it is
> > surprising that programs holding writable file descriptors can continue
> > to write to and truncate files after the immutable flag has been set,
> > but they cannot call other things such as utimes, fallocate, unlink,
> > link, setxattr, or reflink.
> 
> I still think living code beats documentation.  And as far as I can
> tell the immutable bit never behaved as documented or implemented
> in this series on Linux, and it originated on Linux.

The behavior has never been consistent -- since the beginning you can
keep write()ing to a fd after the file becomes immutable, but you can't
ftruncate() it.  I would really like to make the behavior consistent.
Since the authors of nearly every new system call and ioctl since the
late 1990s have interpreted S_IMMUTABLE to mean "immutable takes effect
everywhere immediately" I resolved the inconsistency in favor of that
interpretation.

I asked Ted what he thought that that userspace having the ability to
continue writing to an immutable file, and he thought it was an
implementation bug that had been there for 25 years.  Even he thought
that immutable should take effect immediately everywhere.

> If you want  hard cut off style immutable flag it should really be a
> new API, but I don't really see the point.  It isn't like the usual
> workload is to set the flag on a file actively in use.

FWIW Ted also thought that since it's rare for admins to set +i on a
file actively in use we could just change it without forcing everyone
onto a new api.

--D

Re: [Ocfs2-devel] [PATCH 2/7] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-24 Thread Darrick J. Wong

On Mon, Jun 24, 2019 at 02:58:17PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 24, 2019 at 01:37:37PM +0200, Jan Kara wrote:
> > On Fri 21-06-19 16:57:07, Darrick J. Wong wrote:
> > > From: Darrick J. Wong 
> > > 
> > > When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
> > > need to ensure that userspace can't continue to write the file after the
> > > file becomes immutable.  To make that happen, we have to flush all the
> > > dirty pagecache pages to disk to ensure that we can fail a page fault on
> > > a mmap'd region, wait for pending directio to complete, and hope the
> > > caller locked out any new writes by holding the inode lock.
> > > 
> > > Signed-off-by: Darrick J. Wong 
> > 
> > Seeing the way this worked out, is there a reason to have separate
> > vfs_ioc_setflags_flush_data() instead of folding the functionality in
> > vfs_ioc_setflags_check() (possibly renaming it to
> > vfs_ioc_setflags_prepare() to indicate it does already some changes)? I
> > don't see any place that would need these two separated...
> 
> XFS needs them to be separated.
> 
> If we even /think/ that we're going to be setting the immutable flag
> then we need to grab the IOLOCK and the MMAPLOCK to prevent further
> writes while we drain all the directio writes and dirty data.  IO
> completions for the write draining can take the ILOCK, which means that
> we can't have grabbed it yet.
> 
> Next, we grab the ILOCK so we can check the new flags against the inode
> and then update the inode core.
> 
> For most filesystems I think it suffices to inode_lock and then do both,
> though.

Heh, lol, that applies to fssetxattr, not to setflags, because xfs
setflags implementation open-codes the relevant fssetxattr pieces.
So for setflags we can combine both parts into a single _prepare
function.

--D

> > > +/*
> > > + * Flush all pending IO and dirty mappings before setting S_IMMUTABLE on 
> > > an
> > > + * inode via FS_IOC_SETFLAGS.  If the flush fails we'll clear the flag 
> > > before
> > > + * returning error.
> > > + *
> > > + * Note: the caller should be holding i_mutex, or else be sure that
> > > + * they have exclusive access to the inode structure.
> > > + */
> > > +static inline int vfs_ioc_setflags_flush_data(struct inode *inode, int 
> > > flags)
> > > +{
> > > + int ret;
> > > +
> > > + if (!vfs_ioc_setflags_need_flush(inode, flags))
> > > + return 0;
> > > +
> > > + inode_set_flags(inode, S_IMMUTABLE, S_IMMUTABLE);
> > > + ret = inode_flush_data(inode);
> > > + if (ret)
> > > + inode_set_flags(inode, 0, S_IMMUTABLE);
> > > + return ret;
> > > +}
> > 
> > Also this sets S_IMMUTABLE whenever vfs_ioc_setflags_need_flush() returns
> > true. That is currently the right thing but seems like a landmine waiting
> > to trip? So I'd just drop the vfs_ioc_setflags_need_flush() abstraction to
> > make it clear what's going on.
> 
> Ok.
> 
> --D
> 
> > 
> > Honza
> > -- 
> > Jan Kara 
> > SUSE Labs, CR
> 
> ___
> Ocfs2-devel mailing list
> ocfs2-de...@oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [PATCH 2/7] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-24 Thread Darrick J. Wong

On Mon, Jun 24, 2019 at 01:37:37PM +0200, Jan Kara wrote:
> On Fri 21-06-19 16:57:07, Darrick J. Wong wrote:
> > From: Darrick J. Wong 
> > 
> > When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
> > need to ensure that userspace can't continue to write the file after the
> > file becomes immutable.  To make that happen, we have to flush all the
> > dirty pagecache pages to disk to ensure that we can fail a page fault on
> > a mmap'd region, wait for pending directio to complete, and hope the
> > caller locked out any new writes by holding the inode lock.
> > 
> > Signed-off-by: Darrick J. Wong 
> 
> Seeing the way this worked out, is there a reason to have separate
> vfs_ioc_setflags_flush_data() instead of folding the functionality in
> vfs_ioc_setflags_check() (possibly renaming it to
> vfs_ioc_setflags_prepare() to indicate it does already some changes)? I
> don't see any place that would need these two separated...

XFS needs them to be separated.

If we even /think/ that we're going to be setting the immutable flag
then we need to grab the IOLOCK and the MMAPLOCK to prevent further
writes while we drain all the directio writes and dirty data.  IO
completions for the write draining can take the ILOCK, which means that
we can't have grabbed it yet.

Next, we grab the ILOCK so we can check the new flags against the inode
and then update the inode core.

For most filesystems I think it suffices to inode_lock and then do both,
though.

> > +/*
> > + * Flush all pending IO and dirty mappings before setting S_IMMUTABLE on an
> > + * inode via FS_IOC_SETFLAGS.  If the flush fails we'll clear the flag 
> > before
> > + * returning error.
> > + *
> > + * Note: the caller should be holding i_mutex, or else be sure that
> > + * they have exclusive access to the inode structure.
> > + */
> > +static inline int vfs_ioc_setflags_flush_data(struct inode *inode, int 
> > flags)
> > +{
> > +   int ret;
> > +
> > +   if (!vfs_ioc_setflags_need_flush(inode, flags))
> > +   return 0;
> > +
> > +   inode_set_flags(inode, S_IMMUTABLE, S_IMMUTABLE);
> > +   ret = inode_flush_data(inode);
> > +   if (ret)
> > +   inode_set_flags(inode, 0, S_IMMUTABLE);
> > +   return ret;
> > +}
> 
> Also this sets S_IMMUTABLE whenever vfs_ioc_setflags_need_flush() returns
> true. That is currently the right thing but seems like a landmine waiting
> to trip? So I'd just drop the vfs_ioc_setflags_need_flush() abstraction to
> make it clear what's going on.

Ok.

--D

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR

Re: [PATCH 2/7] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-24 Thread Darrick J. Wong

On Mon, Jun 24, 2019 at 05:33:58PM +0200, Jan Kara wrote:
> On Fri 21-06-19 16:57:07, Darrick J. Wong wrote:
> > +/*
> > + * Flush file data before changing attributes.  Caller must hold any locks
> > + * required to prevent further writes to this file until we're done setting
> > + * flags.
> > + */
> > +static inline int inode_flush_data(struct inode *inode)
> > +{
> > +   inode_dio_wait(inode);
> > +   return filemap_write_and_wait(inode->i_mapping);
> > +}
> 
> BTW, how about calling this function inode_drain_writes() instead? The
> 'flush_data' part is more a detail of implementation of write draining than
> what we need to do to set immutable flag.

Ok, that's a much better description of what the function does.

--D

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR

Re: [PATCH 1/6] iomap: Use a IOMAP_COW/srcmap for a read-modify-write I/O

2019-06-21 Thread Darrick J. Wong

On Fri, Jun 21, 2019 at 02:28:23PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Introduces a new type IOMAP_COW, which means the data at offset
> must be read from a srcmap and copied before performing the
> write on the offset.
> 
> The srcmap is used to identify where the read is to be performed
> from. This is passed to iomap->begin(), which is supposed to
> put in the details for reading, typically set with type IOMAP_READ.

What is IOMAP_READ ?

> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c  |  8 +---
>  fs/ext2/inode.c   |  2 +-
>  fs/ext4/inode.c   |  2 +-
>  fs/gfs2/bmap.c|  3 ++-
>  fs/internal.h |  2 +-
>  fs/iomap.c| 31 ---
>  fs/xfs/xfs_iomap.c|  9 ++---
>  include/linux/iomap.h |  4 +++-
>  8 files changed, 35 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 2e48c7ebb973..80b9e2599223 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1078,7 +1078,7 @@ EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> - struct iomap *iomap)
> + struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct block_device *bdev = iomap->bdev;
>   struct dax_device *dax_dev = iomap->dax_dev;
> @@ -1236,6 +1236,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   unsigned long vaddr = vmf->address;
>   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   unsigned flags = IOMAP_FAULT;
>   int error, major = 0;
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
> @@ -1280,7 +1281,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>* the file system block size to be equal the page size, which means
>* that we never have to deal with more than a single extent here.
>*/
> - error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
> + error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap, &srcmap);
>   if (iomap_errp)
>   *iomap_errp = error;
>   if (error) {
> @@ -1460,6 +1461,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   struct inode *inode = mapping->host;
>   vm_fault_t result = VM_FAULT_FALLBACK;
>   struct iomap iomap = { 0 };
> + struct iomap srcmap = { 0 };
>   pgoff_t max_pgoff;
>   void *entry;
>   loff_t pos;
> @@ -1534,7 +1536,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>* to look up our filesystem block.
>*/
>   pos = (loff_t)xas.xa_index << PAGE_SHIFT;
> - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> + error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap, 
> &srcmap);

Line too long?

Also, I guess the DAX and directio write paths will just WARN_ON_ONCE if
someone feeds them an IOMAP_COW type iomap?

Ah, right, I guess the only filesystems that use iomap directio and
iomap dax don't support COW. :)

--D

>   if (error)
>   goto unlock_entry;
>  
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index e474127dd255..f081f11980ad 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -801,7 +801,7 @@ int ext2_get_block(struct inode *inode, sector_t iblock,
>  
>  #ifdef CONFIG_FS_DAX
>  static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> length,
> - unsigned flags, struct iomap *iomap)
> + unsigned flags, struct iomap *iomap, struct iomap *srcmap)
>  {
>   unsigned int blkbits = inode->i_blkbits;
>   unsigned long first_block = offset >> blkbits;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7f77c643008..a8017e0c302b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3437,7 +3437,7 @@ static bool ext4_inode_datasync_dirty(struct inode 
> *inode)
>  }
>  
>  static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t 
> length,
> - unsigned flags, struct iomap *iomap)
> + unsigned flags, struct iomap *iomap, struct iomap 
> *srcmap)
>  {
>   struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>   unsigned int blkbits = inode->i_blkbits;
> diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
> index 93ea1d529aa3..affa0c4305b7 100644
> --- a/fs/gfs2/bmap.c
> +++ b/fs/gfs2/bmap.c
> @@ -1124,7 +1124,8 @@ static int gfs2_iomap_begin_write(struct inode *inode, 
> loff_t pos,
>  }
>  
>  static int gfs2_iomap_begin(struct inode *inode, loff_t pos, loff_t length,
> - unsigned flags, struct iomap *iomap)
> + unsigned flags, struct iomap *iomap,
> + struct iomap *srcmap)
>  {
>   struct gfs2_inode *ip = GFS2_I(inode);
>   struct metapath mp = { .mp_ah

Re: [PATCH 2/6] iomap: Read page from srcmap for IOMAP_COW

2019-06-21 Thread Darrick J. Wong

On Fri, Jun 21, 2019 at 02:28:24PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 

Commit message needed here...

> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/iomap.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 6648957af268..8a7b20e432ef 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -655,7 +655,7 @@ __iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len,
>  
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
> - struct page **pagep, struct iomap *iomap)
> + struct page **pagep, struct iomap *iomap, struct iomap *srcmap)
>  {
>   const struct iomap_page_ops *page_ops = iomap->page_ops;
>   pgoff_t index = pos >> PAGE_SHIFT;
> @@ -681,6 +681,8 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>  
>   if (iomap->type == IOMAP_INLINE)
>   iomap_read_inline_data(inode, page, iomap);
> + else if (iomap->type == IOMAP_COW)
> + status = __iomap_write_begin(inode, pos, len, page, srcmap);

Pardon my stream of consciousness while I try to reason about this
change...

Hmm.  For writes to the page cache (which aren't necessarily aligned to
a page granularity), this part of the iomap code has used whatever iomap
the fs provides to read in whatever page contents are needed from disk
so that we can do a read-modify-write through the page cache.

For XFS this means that we (almost*) always report data fork extents in
response to a write query, even if the write would be COW, because we
know that we won't need the cow fork mapping until writeback.  This has
led to the sort of funny situation where an (IOMAP_WRITE|IOMAP_DIRECT)
request will return the COW fork extent, but an (IOMAP_WRITE) request
returns the data fork extent.

(* "almost", because we will sometimes return a cow fork extent if the
data fork is a hole and the file has extent size hints enabled.  We're
safe from reading in stale disk contents because cow fork extents do not
achieve written status until writeback completes, and the page stays
locked so we can't write and writeback it simultaneously)

I /think/ this finally enables us to fix that weird quirk of the xfs
iomap methods, because now we always report the write address and we
always report the read address if the actor is supposed to do a
read-modify-write.  It's the actor's responsibilty to sort that out,
not the ->iomap_begin function's.

--D

>   else if (iomap->flags & IOMAP_F_BUFFER_HEAD)
>   status = __block_write_begin_int(page, pos, len, NULL, iomap);
>   else
> @@ -833,7 +835,7 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>   }
>  
>   status = iomap_write_begin(inode, pos, bytes, flags, &page,
> - iomap);
> + iomap, srcmap);
>   if (unlikely(status))
>   break;
>  
> @@ -932,7 +934,7 @@ iomap_dirty_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>   return PTR_ERR(rpage);
>  
>   status = iomap_write_begin(inode, pos, bytes,
> -AOP_FLAG_NOFS, &page, iomap);
> +AOP_FLAG_NOFS, &page, iomap, srcmap);
>   put_page(rpage);
>   if (unlikely(status))
>   return status;
> @@ -978,13 +980,13 @@ iomap_file_dirty(struct inode *inode, loff_t pos, 
> loff_t len,
>  EXPORT_SYMBOL_GPL(iomap_file_dirty);
>  
>  static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> - unsigned bytes, struct iomap *iomap)
> + unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct page *page;
>   int status;
>  
>   status = iomap_write_begin(inode, pos, bytes, AOP_FLAG_NOFS, &page,
> -iomap);
> +iomap, srcmap);
>   if (status)
>   return status;
>  
> @@ -1022,7 +1024,7 @@ iomap_zero_range_actor(struct inode *inode, loff_t pos, 
> loff_t count,
>   if (IS_DAX(inode))
>   status = iomap_dax_zero(pos, offset, bytes, iomap);
>   else
> - status = iomap_zero(inode, pos, offset, bytes, iomap);
> + status = iomap_zero(inode, pos, offset, bytes, iomap, 
> srcmap);
>   if (status < 0)
>   return status;
>  
> -- 
> 2.16.4
>

Re: [PATCH 3/6] iomap: Check iblocksize before transforming page->private

2019-06-21 Thread Darrick J. Wong

On Fri, Jun 21, 2019 at 02:28:25PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> btrfs uses page->private as well to store extent_buffer. Make
> the check stricter to make sure we are using page->private for iop by
> comparing iblocksize < PAGE_SIZE.
> 
> Signed-off-by: Goldwyn Rodrigues 

/me wonders what will happen when btrfs decides to support blocksize !=
pagesize... will we have to add a pointer to struct iomap_page so that
btrfs can continue to associate an extent_buffer with a page?

--D

> ---
>  include/linux/iomap.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index f49767c7fd83..6511124e58b6 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -128,7 +128,8 @@ struct iomap_page {
>  
>  static inline struct iomap_page *to_iomap_page(struct page *page)
>  {
> - if (page_has_private(page))
> + if (i_blocksize(page->mapping->host) < PAGE_SIZE &&
> + page_has_private(page))
>   return (struct iomap_page *)page_private(page);
>   return NULL;
>  }
> -- 
> 2.16.4
>

[PATCH 7/7] vfs: don't allow writes to swap files

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

Don't let userspace write to an active swap file because the kernel
effectively has a long term lease on the storage and things could get
seriously corrupted if we let this happen.

Signed-off-by: Darrick J. Wong 
---
 fs/attr.c |3 +++
 mm/filemap.c  |3 +++
 mm/memory.c   |4 +++-
 mm/mmap.c |2 ++
 mm/swapfile.c |   15 +--
 5 files changed, 24 insertions(+), 3 deletions(-)


diff --git a/fs/attr.c b/fs/attr.c
index 1fcfdcc5b367..42f4d4fb0631 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -236,6 +236,9 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr, struct inode **de
if (IS_IMMUTABLE(inode))
return -EPERM;
 
+   if (IS_SWAPFILE(inode))
+   return -ETXTBSY;
+
if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
IS_APPEND(inode))
return -EPERM;
diff --git a/mm/filemap.c b/mm/filemap.c
index dad85e10f5f8..fd80bc20e30a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2938,6 +2938,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
if (IS_IMMUTABLE(inode))
return -EPERM;
 
+   if (IS_SWAPFILE(inode))
+   return -ETXTBSY;
+
if (!iov_iter_count(from))
return 0;
 
diff --git a/mm/memory.c b/mm/memory.c
index 4311cfdade90..c04c6a689995 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2235,7 +2235,9 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
-   if (vmf->vma->vm_file && IS_IMMUTABLE(file_inode(vmf->vma->vm_file)))
+   if (vmf->vma->vm_file &&
+   (IS_IMMUTABLE(file_inode(vmf->vma->vm_file)) ||
+IS_SWAPFILE(file_inode(vmf->vma->vm_file
return VM_FAULT_SIGBUS;
 
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
diff --git a/mm/mmap.c b/mm/mmap.c
index ac1e32205237..031807339869 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1488,6 +1488,8 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
return -EACCES;
if (IS_IMMUTABLE(file_inode(file)))
return -EPERM;
+   if (IS_SWAPFILE(file_inode(file)))
+   return -ETXTBSY;
}
 
/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 596ac98051c5..390859785558 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3165,6 +3165,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, 
specialfile, int, swap_flags)
if (error)
goto bad_swap;
 
+   /*
+* Flush any pending IO and dirty mappings before we start using this
+* swap file.
+*/
+   if (S_ISREG(inode->i_mode)) {
+   inode->i_flags |= S_SWAPFILE;
+   error = inode_flush_data(inode);
+   if (error) {
+   inode->i_flags &= ~S_SWAPFILE;
+   goto bad_swap;
+   }
+   }
+
mutex_lock(&swapon_mutex);
prio = -1;
if (swap_flags & SWAP_FLAG_PREFER)
@@ -3185,8 +3198,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, 
int, swap_flags)
atomic_inc(&proc_poll_event);
wake_up_interruptible(&proc_poll_wait);
 
-   if (S_ISREG(inode->i_mode))
-   inode->i_flags |= S_SWAPFILE;
error = 0;
goto out;
 bad_swap:

[PATCH 5/7] xfs: refactor setflags to use setattr code directly

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

Refactor the SETFLAGS implementation to use the SETXATTR code directly
instead of partially constructing a struct fsxattr and calling bits and
pieces of the setxattr code.  This reduces code size and becomes
necessary in the next patch to maintain the behavior of allowing
userspace to set immutable on an immutable file so long as nothing
/else/ about the attributes change.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_ioctl.c |   40 +++-
 1 file changed, 3 insertions(+), 37 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 88583b3e1e76..7b19ba2956ad 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1491,11 +1491,8 @@ xfs_ioc_setxflags(
struct file *filp,
void__user *arg)
 {
-   struct xfs_trans*tp;
struct fsxattr  fa;
-   struct fsxattr  old_fa;
unsigned intflags;
-   int join_flags = 0;
int error;
 
if (copy_from_user(&flags, arg, sizeof(flags)))
@@ -1506,44 +1503,13 @@ xfs_ioc_setxflags(
  FS_SYNC_FL))
return -EOPNOTSUPP;
 
-   fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
+   __xfs_ioc_fsgetxattr(ip, false, &fa);
+   fa.fsx_xflags = xfs_merge_ioc_xflags(flags, fa.fsx_xflags);
 
error = mnt_want_write_file(filp);
if (error)
return error;
-
-   /*
-* Changing DAX config may require inode locking for mapping
-* invalidation. These need to be held all the way to transaction commit
-* or cancel time, so need to be passed through to
-* xfs_ioctl_setattr_get_trans() so it can apply them to the join call
-* appropriately.
-*/
-   error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
-   if (error)
-   goto out_drop_write;
-
-   tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
-   if (IS_ERR(tp)) {
-   error = PTR_ERR(tp);
-   goto out_drop_write;
-   }
-
-   __xfs_ioc_fsgetxattr(ip, false, &old_fa);
-   error = vfs_ioc_fssetxattr_check(VFS_I(ip), &old_fa, &fa);
-   if (error) {
-   xfs_trans_cancel(tp);
-   goto out_drop_write;
-   }
-
-   error = xfs_ioctl_setattr_xflags(tp, ip, &fa);
-   if (error) {
-   xfs_trans_cancel(tp);
-   goto out_drop_write;
-   }
-
-   error = xfs_trans_commit(tp);
-out_drop_write:
+   error = xfs_ioctl_setattr(ip, &fa);
mnt_drop_write_file(filp);
return error;
 }

[PATCH 1/7] mm/fs: don't allow writes to immutable files

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

Once the flag is set, it is enforced for quite a few file operations,
such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
don't check for immutability when doing a write(), a PROT_WRITE mmap(),
a truncate(), or a write to a previously established mmap.

If a program has an open write fd to a file that the administrator
subsequently marks immutable, the program still can change the file
contents.  Weird!

The ability to write to an immutable file does not follow the manpage
promise that immutable files cannot be modified.  Worse yet it's
inconsistent with the behavior of other syscalls which don't allow
modifications of immutable files.

Therefore, add the necessary checks to make the write, mmap, and
truncate behavior consistent with what the manpage says and consistent
with other syscalls on filesystems which support IMMUTABLE.

Signed-off-by: Darrick J. Wong 
---
 fs/attr.c|   13 ++---
 mm/filemap.c |3 +++
 mm/memory.c  |3 +++
 mm/mmap.c|8 ++--
 4 files changed, 18 insertions(+), 9 deletions(-)


diff --git a/fs/attr.c b/fs/attr.c
index d22e8187477f..1fcfdcc5b367 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -233,19 +233,18 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr, struct inode **de
 
WARN_ON_ONCE(!inode_is_locked(inode));
 
-   if (ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) {
-   if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   return -EPERM;
-   }
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
+   if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
+   IS_APPEND(inode))
+   return -EPERM;
 
/*
 * If utimes(2) and friends are called with times == NULL (or both
 * times are UTIME_NOW), then we need to check for write permission
 */
if (ia_valid & ATTR_TOUCH) {
-   if (IS_IMMUTABLE(inode))
-   return -EPERM;
-
if (!inode_owner_or_capable(inode)) {
error = inode_permission(inode, MAY_WRITE);
if (error)
diff --git a/mm/filemap.c b/mm/filemap.c
index aac71aef4c61..dad85e10f5f8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2935,6 +2935,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
loff_t count;
int ret;
 
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
if (!iov_iter_count(from))
return 0;
 
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..4311cfdade90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2235,6 +2235,9 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
+   if (vmf->vma->vm_file && IS_IMMUTABLE(file_inode(vmf->vma->vm_file)))
+   return VM_FAULT_SIGBUS;
+
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
/* Restore original flags so that caller is not surprised */
vmf->flags = old_flags;
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..ac1e32205237 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1483,8 +1483,12 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
-   if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
-   return -EACCES;
+   if (prot & PROT_WRITE) {
+   if (!(file->f_mode & FMODE_WRITE))
+   return -EACCES;
+   if (IS_IMMUTABLE(file_inode(file)))
+   return -EPERM;
+   }
 
/*
 * Make sure we don't allow writing to an append-only

[PATCH 6/7] xfs: clean up xfs_merge_ioc_xflags

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

Clean up the calling convention since we're editing the fsxattr struct
anyway.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_ioctl.c |   32 ++--
 1 file changed, 14 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 7b19ba2956ad..a67bc9afdd0b 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -829,35 +829,31 @@ xfs_ioc_ag_geometry(
  * Linux extended inode flags interface.
  */
 
-STATIC unsigned int
+static inline void
 xfs_merge_ioc_xflags(
-   unsigned intflags,
-   unsigned intstart)
+   struct fsxattr  *fa,
+   unsigned intflags)
 {
-   unsigned intxflags = start;
-
if (flags & FS_IMMUTABLE_FL)
-   xflags |= FS_XFLAG_IMMUTABLE;
+   fa->fsx_xflags |= FS_XFLAG_IMMUTABLE;
else
-   xflags &= ~FS_XFLAG_IMMUTABLE;
+   fa->fsx_xflags &= ~FS_XFLAG_IMMUTABLE;
if (flags & FS_APPEND_FL)
-   xflags |= FS_XFLAG_APPEND;
+   fa->fsx_xflags |= FS_XFLAG_APPEND;
else
-   xflags &= ~FS_XFLAG_APPEND;
+   fa->fsx_xflags &= ~FS_XFLAG_APPEND;
if (flags & FS_SYNC_FL)
-   xflags |= FS_XFLAG_SYNC;
+   fa->fsx_xflags |= FS_XFLAG_SYNC;
else
-   xflags &= ~FS_XFLAG_SYNC;
+   fa->fsx_xflags &= ~FS_XFLAG_SYNC;
if (flags & FS_NOATIME_FL)
-   xflags |= FS_XFLAG_NOATIME;
+   fa->fsx_xflags |= FS_XFLAG_NOATIME;
else
-   xflags &= ~FS_XFLAG_NOATIME;
+   fa->fsx_xflags &= ~FS_XFLAG_NOATIME;
if (flags & FS_NODUMP_FL)
-   xflags |= FS_XFLAG_NODUMP;
+   fa->fsx_xflags |= FS_XFLAG_NODUMP;
else
-   xflags &= ~FS_XFLAG_NODUMP;
-
-   return xflags;
+   fa->fsx_xflags &= ~FS_XFLAG_NODUMP;
 }
 
 STATIC unsigned int
@@ -1504,7 +1500,7 @@ xfs_ioc_setxflags(
return -EOPNOTSUPP;
 
__xfs_ioc_fsgetxattr(ip, false, &fa);
-   fa.fsx_xflags = xfs_merge_ioc_xflags(flags, fa.fsx_xflags);
+   xfs_merge_ioc_xflags(&fa, flags);
 
error = mnt_want_write_file(filp);
if (error)

[PATCH v4 0/7] vfs: make immutable files actually immutable

2019-06-21 Thread Darrick J. Wong

Hi all,

The chattr(1) manpage has this to say about the immutable bit that
system administrators can set on files:

Note that various distro manpages points out the inconsistent behavior
of the various Linux filesystems w.r.t. immutable. This fixes all that.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This has been lightly tested with fstests. Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=immutable-files

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=immutable-files

[PATCH 2/7] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
need to ensure that userspace can't continue to write the file after the
file becomes immutable.  To make that happen, we have to flush all the
dirty pagecache pages to disk to ensure that we can fail a page fault on
a mmap'd region, wait for pending directio to complete, and hope the
caller locked out any new writes by holding the inode lock.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c   |3 +++
 fs/efivarfs/file.c |5 +
 fs/ext2/ioctl.c|5 +
 fs/ext4/ioctl.c|3 +++
 fs/f2fs/file.c |3 +++
 fs/hfsplus/ioctl.c |3 +++
 fs/nilfs2/ioctl.c  |3 +++
 fs/ocfs2/ioctl.c   |3 +++
 fs/orangefs/file.c |   11 ---
 fs/orangefs/protocol.h |3 +++
 fs/reiserfs/ioctl.c|3 +++
 fs/ubifs/ioctl.c   |3 +++
 include/linux/fs.h |   48 
 13 files changed, 93 insertions(+), 3 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7ddda5b4b6a6..f431813b2454 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -214,6 +214,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
ret = vfs_ioc_setflags_check(inode, old_fsflags, fsflags);
+   if (ret)
+   goto out_unlock;
+   ret = vfs_ioc_setflags_flush_data(inode, fsflags);
if (ret)
goto out_unlock;
 
diff --git a/fs/efivarfs/file.c b/fs/efivarfs/file.c
index f4f6c1bec132..845016a67724 100644
--- a/fs/efivarfs/file.c
+++ b/fs/efivarfs/file.c
@@ -163,6 +163,11 @@ efivarfs_ioc_setxflags(struct file *file, void __user *arg)
return error;
 
inode_lock(inode);
+   error = vfs_ioc_setflags_flush_data(inode, flags);
+   if (error) {
+   inode_unlock(inode);
+   return error;
+   }
inode_set_flags(inode, i_flags, S_IMMUTABLE);
inode_unlock(inode);
 
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 88b3b9720023..75f75619237c 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -65,6 +65,11 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
inode_unlock(inode);
goto setflags_out;
}
+   ret = vfs_ioc_setflags_flush_data(inode, flags);
+   if (ret) {
+   inode_unlock(inode);
+   goto setflags_out;
+   }
 
flags = flags & EXT2_FL_USER_MODIFIABLE;
flags |= oldflags & ~EXT2_FL_USER_MODIFIABLE;
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 6aa1df1918f7..a05341b94d98 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -290,6 +290,9 @@ static int ext4_ioctl_setflags(struct inode *inode,
jflag = flags & EXT4_JOURNAL_DATA_FL;
 
err = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (err)
+   goto flags_out;
+   err = vfs_ioc_setflags_flush_data(inode, flags);
if (err)
goto flags_out;
 
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 183ed1ac60e1..d3cf4bdb8738 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1681,6 +1681,9 @@ static int __f2fs_ioc_setflags(struct inode *inode, 
unsigned int flags)
oldflags = fi->i_flags;
 
err = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (err)
+   return err;
+   err = vfs_ioc_setflags_flush_data(inode, flags);
if (err)
return err;
 
diff --git a/fs/hfsplus/ioctl.c b/fs/hfsplus/ioctl.c
index 862a3c9481d7..f8295fa35237 100644
--- a/fs/hfsplus/ioctl.c
+++ b/fs/hfsplus/ioctl.c
@@ -104,6 +104,9 @@ static int hfsplus_ioctl_setflags(struct file *file, int 
__user *user_flags)
inode_lock(inode);
 
err = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (err)
+   goto out_unlock_inode;
+   err = vfs_ioc_setflags_flush_data(inode, flags);
if (err)
goto out_unlock_inode;
 
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 0632336d2515..a3c200ab9f60 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -149,6 +149,9 @@ static int nilfs_ioctl_setflags(struct inode *inode, struct 
file *filp,
oldflags = NILFS_I(inode)->i_flags;
 
ret = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (ret)
+   goto out;
+   ret = vfs_ioc_setflags_flush_data(inode, flags);
if (ret)
goto out;
 
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index 467a2faf0305..e91ca0dad3d7 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -107,6 +107,9 @@ static int ocfs2_set_inode_attr(struct inode *inode, 
unsigned flag

[PATCH 4/7] vfs: don't allow most setxattr to immutable files

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

However, we don't actually check the immutable flag in the setattr code,
which means that we can update inode flags and project ids and extent
size hints on supposedly immutable files.  Therefore, reject setflags
and fssetxattr calls on an immutable file if the file is immutable and
will remain that way.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   27 +++
 1 file changed, 27 insertions(+)


diff --git a/fs/inode.c b/fs/inode.c
index 6374ad2ef25b..220caefc31f7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2204,6 +2204,14 @@ int vfs_ioc_setflags_check(struct inode *inode, int 
oldflags, int flags)
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
+   /*
+* We aren't allowed to change any other flags if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((oldflags & FS_IMMUTABLE_FL) && (flags & FS_IMMUTABLE_FL) &&
+   oldflags != flags)
+   return -EPERM;
+
return 0;
 }
 EXPORT_SYMBOL(vfs_ioc_setflags_check);
@@ -2246,6 +2254,25 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
return -EINVAL;
 
+   /*
+* We aren't allowed to change any fields if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((old_fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
+   (fa->fsx_xflags & FS_XFLAG_IMMUTABLE)) {
+   if (old_fa->fsx_xflags != fa->fsx_xflags)
+   return -EPERM;
+   if (old_fa->fsx_projid != fa->fsx_projid)
+   return -EPERM;
+   if ((fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+  FS_XFLAG_EXTSZINHERIT)) &&
+   old_fa->fsx_extsize != fa->fsx_extsize)
+   return -EPERM;
+   if ((old_fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
+   old_fa->fsx_cowextsize != fa->fsx_cowextsize)
+   return -EPERM;
+   }
+
/* Extent size hints of zero turn off the flags. */
if (fa->fsx_extsize == 0)
fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);

[PATCH 3/7] vfs: flush and wait for io when setting the immutable flag via FSSETXATTR

2019-06-21 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_FSSETXATTR to set the immutable flag on a file,
we need to ensure that userspace can't continue to write the file after
the file becomes immutable.  To make that happen, we have to flush all
the dirty pagecache pages to disk to ensure that we can fail a page
fault on a mmap'd region, wait for pending directio to complete, and
hope the caller locked out any new writes by holding the inode lock.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c   |3 +++
 fs/ext4/ioctl.c|3 +++
 fs/f2fs/file.c |3 +++
 fs/xfs/xfs_ioctl.c |   39 +--
 include/linux/fs.h |   37 +
 5 files changed, 79 insertions(+), 6 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f431813b2454..63a9281e6ce0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -432,6 +432,9 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
 
__btrfs_ioctl_fsgetxattr(binode, &old_fa);
ret = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (ret)
+   goto out_unlock;
+   ret = vfs_ioc_fssetxattr_flush_data(inode, &fa);
if (ret)
goto out_unlock;
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a05341b94d98..6037585c1520 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1115,6 +1115,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
inode_lock(inode);
ext4_fsgetxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (err)
+   goto out;
+   err = vfs_ioc_fssetxattr_flush_data(inode, &fa);
if (err)
goto out;
flags = (ei->i_flags & ~EXT4_FL_XFLAG_VISIBLE) |
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index d3cf4bdb8738..97f4bb36540f 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2832,6 +2832,9 @@ static int f2fs_ioc_fssetxattr(struct file *filp, 
unsigned long arg)
 
__f2fs_ioc_fsgetxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (err)
+   goto out;
+   err = vfs_ioc_fssetxattr_flush_data(inode, &fa);
if (err)
goto out;
flags = (fi->i_flags & ~F2FS_FL_XFLAG_VISIBLE) |
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index b494e7e881e3..88583b3e1e76 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1014,6 +1014,28 @@ xfs_diflags_to_linux(
 #endif
 }
 
+/*
+ * Lock the inode against file io and page faults, then flush all dirty pages
+ * and wait for writeback and direct IO operations to finish.  Returns with
+ * the relevant inode lock flags set in @join_flags.  Caller is responsible for
+ * unlocking even on error return.
+ */
+static int
+xfs_ioctl_setattr_flush(
+   struct xfs_inode*ip,
+   int *join_flags)
+{
+   /* Already locked the inode from IO?  Assume we're done. */
+   if (((*join_flags) & (XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL)) ==
+(XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL))
+   return 0;
+
+   /* Lock and flush all mappings and IO in preparation for flag change */
+   *join_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+   xfs_ilock(ip, *join_flags);
+   return inode_flush_data(VFS_I(ip));
+}
+
 static int
 xfs_ioctl_setattr_xflags(
struct xfs_trans*tp,
@@ -1099,23 +1121,22 @@ xfs_ioctl_setattr_dax_invalidate(
if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
return 0;
 
-   if (S_ISDIR(inode->i_mode))
+   if (!S_ISREG(inode->i_mode))
return 0;
 
-   /* lock, flush and invalidate mapping in preparation for flag change */
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
-   error = filemap_write_and_wait(inode->i_mapping);
+   error = xfs_ioctl_setattr_flush(ip, join_flags);
if (error)
goto out_unlock;
error = invalidate_inode_pages2(inode->i_mapping);
if (error)
goto out_unlock;
 
-   *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
return 0;
 
 out_unlock:
-   xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+   if (*join_flags)
+   xfs_iunlock(ip, *join_flags);
+   *join_flags = 0;
return error;
 
 }
@@ -1337,6 +1358,12 @@ xfs_ioctl_setattr(
if (code)
goto error_free_dquots;
 
+   if (!join_flags && vfs_ioc_fssetxattr_need_flush(VFS_I(ip), fa)) {
+   code = xfs_ioctl_setattr_flush(ip, &join_flags);
+   if (code)
+   goto error_free_dquots;
+   }
+
tp = xfs_ioctl_setattr_get_t

Re: [PATCH 1/6] mm/fs: don't allow writes to immutable files

2019-06-20 Thread Darrick J. Wong

On Thu, Jun 20, 2019 at 05:52:12PM -0400, Theodore Ts'o wrote:
> On Mon, Jun 10, 2019 at 09:46:17PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong 
> > 
> > The chattr manpage has this to say about immutable files:
> > 
> > "A file with the 'i' attribute cannot be modified: it cannot be deleted
> > or renamed, no link can be created to this file, most of the file's
> > metadata can not be modified, and the file can not be opened in write
> > mode."
> > 
> > Once the flag is set, it is enforced for quite a few file operations,
> > such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
> > don't check for immutability when doing a write(), a PROT_WRITE mmap(),
> > a truncate(), or a write to a previously established mmap.
> > 
> > If a program has an open write fd to a file that the administrator
> > subsequently marks immutable, the program still can change the file
> > contents.  Weird!
> > 
> > The ability to write to an immutable file does not follow the manpage
> > promise that immutable files cannot be modified.  Worse yet it's
> > inconsistent with the behavior of other syscalls which don't allow
> > modifications of immutable files.
> > 
> > Therefore, add the necessary checks to make the write, mmap, and
> > truncate behavior consistent with what the manpage says and consistent
> > with other syscalls on filesystems which support IMMUTABLE.
> > 
> > Signed-off-by: Darrick J. Wong 
> 
> I note that this patch doesn't allow writes to swap files.  So Amir's
> generic/554 test will still fail for those file systems that don't use
> copy_file_range.

I didn't add any IS_SWAPFILE checks here, so I'm not sure to what you're
referring?

> I'm indifferent as to whether you add a new patch, or include that
> change in this patch, but perhaps we should fix this while we're
> making changes in these code paths?

The swapfile patches should be in a separate patch, which I was planning
to work on but hadn't really gotten around to it.

--D


>   - Ted

Re: [PATCH 2/6] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-20 Thread Darrick J. Wong

On Thu, Jun 20, 2019 at 04:00:28PM +0200, Jan Kara wrote:
> On Mon 10-06-19 21:46:25, Darrick J. Wong wrote:
> > From: Darrick J. Wong 
> > 
> > When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
> > need to ensure that userspace can't continue to write the file after the
> > file becomes immutable.  To make that happen, we have to flush all the
> > dirty pagecache pages to disk to ensure that we can fail a page fault on
> > a mmap'd region, wait for pending directio to complete, and hope the
> > caller locked out any new writes by holding the inode lock.
> > 
> > Signed-off-by: Darrick J. Wong 
> 
> ...
> 
> > diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> > index 6aa1df1918f7..a05341b94d98 100644
> > --- a/fs/ext4/ioctl.c
> > +++ b/fs/ext4/ioctl.c
> > @@ -290,6 +290,9 @@ static int ext4_ioctl_setflags(struct inode *inode,
> > jflag = flags & EXT4_JOURNAL_DATA_FL;
> >  
> > err = vfs_ioc_setflags_check(inode, oldflags, flags);
> > +   if (err)
> > +   goto flags_out;
> > +   err = vfs_ioc_setflags_flush_data(inode, flags);
> > if (err)
> > goto flags_out;
> >  
> 
> ...
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8dad3c80b611..9c899c63957e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3548,7 +3548,41 @@ static inline struct sock 
> > *io_uring_get_socket(struct file *file)
> >  
> >  int vfs_ioc_setflags_check(struct inode *inode, int oldflags, int flags);
> >  
> > +/*
> > + * Do we need to flush the file data before changing attributes?  When 
> > we're
> > + * setting the immutable flag we must stop all directio writes and flush 
> > the
> > + * dirty pages so that we can fail the page fault on the next write 
> > attempt.
> > + */
> > +static inline bool vfs_ioc_setflags_need_flush(struct inode *inode, int 
> > flags)
> > +{
> > +   if (S_ISREG(inode->i_mode) && !IS_IMMUTABLE(inode) &&
> > +   (flags & FS_IMMUTABLE_FL))
> > +   return true;
> > +
> > +   return false;
> > +}
> > +
> > +/*
> > + * Flush file data before changing attributes.  Caller must hold any locks
> > + * required to prevent further writes to this file until we're done setting
> > + * flags.
> > + */
> > +static inline int inode_flush_data(struct inode *inode)
> > +{
> > +   inode_dio_wait(inode);
> > +   return filemap_write_and_wait(inode->i_mapping);
> > +}
> > +
> > +/* Flush file data before changing attributes, if necessary. */
> > +static inline int vfs_ioc_setflags_flush_data(struct inode *inode, int 
> > flags)
> > +{
> > +   if (vfs_ioc_setflags_need_flush(inode, flags))
> > +   return inode_flush_data(inode);
> > +   return 0;
> > +}
> > +
> 
> But this is racy at least for page faults, isn't it? What protects you
> against write faults just after filemap_write_and_wait() has finished?
> So either you need to set FS_IMMUTABLE_FL before flushing data or you need
> to get more protection from the fs than just i_rwsem. In the case of ext4
> that would be i_mmap_rwsem but other filesystems don't have equivalent
> protection...

Yes, I see that now.  I think it'll work to set S_IMMUTABLE before
trying the flush, so long as I am careful to put the call sites right
before we update the inode flags.

--D

> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR

Re: [PATCH 4/6] vfs: don't allow most setxattr to immutable files

2019-06-20 Thread Darrick J. Wong

On Thu, Jun 20, 2019 at 04:03:45PM +0200, Jan Kara wrote:
> On Mon 10-06-19 21:46:45, Darrick J. Wong wrote:
> > From: Darrick J. Wong 
> > 
> > The chattr manpage has this to say about immutable files:
> > 
> > "A file with the 'i' attribute cannot be modified: it cannot be deleted
> > or renamed, no link can be created to this file, most of the file's
> > metadata can not be modified, and the file can not be opened in write
> > mode."
> > 
> > However, we don't actually check the immutable flag in the setattr code,
> > which means that we can update inode flags and project ids and extent
> > size hints on supposedly immutable files.  Therefore, reject setflags
> > and fssetxattr calls on an immutable file if the file is immutable and
> > will remain that way.
> > 
> > Signed-off-by: Darrick J. Wong 
> > ---
> >  fs/inode.c |   31 +++
> >  1 file changed, 31 insertions(+)
> > 
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index a3757051fd55..adfb458bf533 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -2184,6 +2184,17 @@ int vfs_ioc_setflags_check(struct inode *inode, int 
> > oldflags, int flags)
> > !capable(CAP_LINUX_IMMUTABLE))
> > return -EPERM;
> >  
> > +   /*
> > +* We aren't allowed to change any other flags if the immutable flag is
> > +* already set and is not being unset.
> > +*/
> > +   if ((oldflags & FS_IMMUTABLE_FL) &&
> > +   (flags & FS_IMMUTABLE_FL)) {
> > +   if ((oldflags & ~FS_IMMUTABLE_FL) !=
> > +   (flags & ~FS_IMMUTABLE_FL))
> 
> This check looks a bit strange when you've just check FS_IMMUTABLE_FL isn't
> changing... Why not just oldflags != flags?
> 
> > +   if ((old_fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
> > +   (fa->fsx_xflags & FS_XFLAG_IMMUTABLE)) {
> > +   if ((old_fa->fsx_xflags & ~FS_XFLAG_IMMUTABLE) !=
> > +   (fa->fsx_xflags & ~FS_XFLAG_IMMUTABLE))
> 
> Ditto here...

Good point!  I'll fix it.

--D

> 
> > +   return -EPERM;
> > +   if (old_fa->fsx_projid != fa->fsx_projid)
> > +   return -EPERM;
> > +   if ((fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
> > +  FS_XFLAG_EXTSZINHERIT)) &&
> > +   old_fa->fsx_extsize != fa->fsx_extsize)
> > +   return -EPERM;
> > +   if ((old_fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
> > +   old_fa->fsx_cowextsize != fa->fsx_cowextsize)
> > +   return -EPERM;
> > +   }
> > +
> > /* Extent size hints of zero turn off the flags. */
> > if (fa->fsx_extsize == 0)
> > fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);
> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR

[PATCH 6/6] xfs: clean up xfs_merge_ioc_xflags

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

Clean up the calling convention since we're editing the fsxattr struct
anyway.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_ioctl.c |   32 ++--
 1 file changed, 14 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 7b19ba2956ad..a67bc9afdd0b 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -829,35 +829,31 @@ xfs_ioc_ag_geometry(
  * Linux extended inode flags interface.
  */
 
-STATIC unsigned int
+static inline void
 xfs_merge_ioc_xflags(
-   unsigned intflags,
-   unsigned intstart)
+   struct fsxattr  *fa,
+   unsigned intflags)
 {
-   unsigned intxflags = start;
-
if (flags & FS_IMMUTABLE_FL)
-   xflags |= FS_XFLAG_IMMUTABLE;
+   fa->fsx_xflags |= FS_XFLAG_IMMUTABLE;
else
-   xflags &= ~FS_XFLAG_IMMUTABLE;
+   fa->fsx_xflags &= ~FS_XFLAG_IMMUTABLE;
if (flags & FS_APPEND_FL)
-   xflags |= FS_XFLAG_APPEND;
+   fa->fsx_xflags |= FS_XFLAG_APPEND;
else
-   xflags &= ~FS_XFLAG_APPEND;
+   fa->fsx_xflags &= ~FS_XFLAG_APPEND;
if (flags & FS_SYNC_FL)
-   xflags |= FS_XFLAG_SYNC;
+   fa->fsx_xflags |= FS_XFLAG_SYNC;
else
-   xflags &= ~FS_XFLAG_SYNC;
+   fa->fsx_xflags &= ~FS_XFLAG_SYNC;
if (flags & FS_NOATIME_FL)
-   xflags |= FS_XFLAG_NOATIME;
+   fa->fsx_xflags |= FS_XFLAG_NOATIME;
else
-   xflags &= ~FS_XFLAG_NOATIME;
+   fa->fsx_xflags &= ~FS_XFLAG_NOATIME;
if (flags & FS_NODUMP_FL)
-   xflags |= FS_XFLAG_NODUMP;
+   fa->fsx_xflags |= FS_XFLAG_NODUMP;
else
-   xflags &= ~FS_XFLAG_NODUMP;
-
-   return xflags;
+   fa->fsx_xflags &= ~FS_XFLAG_NODUMP;
 }
 
 STATIC unsigned int
@@ -1504,7 +1500,7 @@ xfs_ioc_setxflags(
return -EOPNOTSUPP;
 
__xfs_ioc_fsgetxattr(ip, false, &fa);
-   fa.fsx_xflags = xfs_merge_ioc_xflags(flags, fa.fsx_xflags);
+   xfs_merge_ioc_xflags(&fa, flags);
 
error = mnt_want_write_file(filp);
if (error)

[PATCH 4/4] vfs: teach vfs_ioc_fssetxattr_check to check extent size hints

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

Move the extent size hint checks that aren't xfs-specific to the vfs.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   18 +
 fs/xfs/xfs_ioctl.c |   70 ++--
 2 files changed, 47 insertions(+), 41 deletions(-)


diff --git a/fs/inode.c b/fs/inode.c
index 40ecd3a6a188..a3757051fd55 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2214,6 +2214,24 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
return -EINVAL;
}
 
+   /* Check extent size hints. */
+   if ((fa->fsx_xflags & FS_XFLAG_EXTSIZE) && !S_ISREG(inode->i_mode))
+   return -EINVAL;
+
+   if ((fa->fsx_xflags & FS_XFLAG_EXTSZINHERIT) &&
+   !S_ISDIR(inode->i_mode))
+   return -EINVAL;
+
+   if ((fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
+   !S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
+   return -EINVAL;
+
+   /* Extent size hints of zero turn off the flags. */
+   if (fa->fsx_extsize == 0)
+   fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);
+   if (fa->fsx_cowextsize == 0)
+   fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE;
+
return 0;
 }
 EXPORT_SYMBOL(vfs_ioc_fssetxattr_check);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 82961de98900..b494e7e881e3 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1201,39 +1201,31 @@ xfs_ioctl_setattr_check_extsize(
struct fsxattr  *fa)
 {
struct xfs_mount*mp = ip->i_mount;
-
-   if ((fa->fsx_xflags & FS_XFLAG_EXTSIZE) && !S_ISREG(VFS_I(ip)->i_mode))
-   return -EINVAL;
-
-   if ((fa->fsx_xflags & FS_XFLAG_EXTSZINHERIT) &&
-   !S_ISDIR(VFS_I(ip)->i_mode))
-   return -EINVAL;
+   xfs_extlen_tsize;
+   xfs_fsblock_t   extsize_fsb;
 
if (S_ISREG(VFS_I(ip)->i_mode) && ip->i_d.di_nextents &&
((ip->i_d.di_extsize << mp->m_sb.sb_blocklog) != fa->fsx_extsize))
return -EINVAL;
 
-   if (fa->fsx_extsize != 0) {
-   xfs_extlen_tsize;
-   xfs_fsblock_t   extsize_fsb;
-
-   extsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_extsize);
-   if (extsize_fsb > MAXEXTLEN)
-   return -EINVAL;
+   if (fa->fsx_extsize == 0)
+   return 0;
 
-   if (XFS_IS_REALTIME_INODE(ip) ||
-   (fa->fsx_xflags & FS_XFLAG_REALTIME)) {
-   size = mp->m_sb.sb_rextsize << mp->m_sb.sb_blocklog;
-   } else {
-   size = mp->m_sb.sb_blocksize;
-   if (extsize_fsb > mp->m_sb.sb_agblocks / 2)
-   return -EINVAL;
-   }
+   extsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_extsize);
+   if (extsize_fsb > MAXEXTLEN)
+   return -EINVAL;
 
-   if (fa->fsx_extsize % size)
+   if (XFS_IS_REALTIME_INODE(ip) ||
+   (fa->fsx_xflags & FS_XFLAG_REALTIME)) {
+   size = mp->m_sb.sb_rextsize << mp->m_sb.sb_blocklog;
+   } else {
+   size = mp->m_sb.sb_blocksize;
+   if (extsize_fsb > mp->m_sb.sb_agblocks / 2)
return -EINVAL;
-   } else
-   fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);
+   }
+
+   if (fa->fsx_extsize % size)
+   return -EINVAL;
 
return 0;
 }
@@ -1259,6 +1251,8 @@ xfs_ioctl_setattr_check_cowextsize(
struct fsxattr  *fa)
 {
struct xfs_mount*mp = ip->i_mount;
+   xfs_extlen_tsize;
+   xfs_fsblock_t   cowextsize_fsb;
 
if (!(fa->fsx_xflags & FS_XFLAG_COWEXTSIZE))
return 0;
@@ -1267,25 +1261,19 @@ xfs_ioctl_setattr_check_cowextsize(
ip->i_d.di_version != 3)
return -EINVAL;
 
-   if (!S_ISREG(VFS_I(ip)->i_mode) && !S_ISDIR(VFS_I(ip)->i_mode))
-   return -EINVAL;
-
-   if (fa->fsx_cowextsize != 0) {
-   xfs_extlen_tsize;
-   xfs_fsblock_t   cowextsize_fsb;
+   if (fa->fsx_cowextsize == 0)
+   return 0;
 
-   cowextsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
-   if (cowextsize_fsb > MAXEXTLEN)
-   return -EINVAL;
+   cowextsize_fsb = XFS_B_TO_FSB(mp, fa->fsx_cowextsize);
+   if (cowextsize_fsb > MAXEXTLEN)
+   return -EINVAL;
 
-   size = mp->m_sb.sb_blocksize;
-   if (cowextsize

[PATCH 1/6] mm/fs: don't allow writes to immutable files

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

Once the flag is set, it is enforced for quite a few file operations,
such as fallocate, fpunch, fzero, rm, touch, open, etc.  However, we
don't check for immutability when doing a write(), a PROT_WRITE mmap(),
a truncate(), or a write to a previously established mmap.

If a program has an open write fd to a file that the administrator
subsequently marks immutable, the program still can change the file
contents.  Weird!

The ability to write to an immutable file does not follow the manpage
promise that immutable files cannot be modified.  Worse yet it's
inconsistent with the behavior of other syscalls which don't allow
modifications of immutable files.

Therefore, add the necessary checks to make the write, mmap, and
truncate behavior consistent with what the manpage says and consistent
with other syscalls on filesystems which support IMMUTABLE.

Signed-off-by: Darrick J. Wong 
---
 fs/attr.c|   13 ++---
 mm/filemap.c |3 +++
 mm/memory.c  |3 +++
 mm/mmap.c|8 ++--
 4 files changed, 18 insertions(+), 9 deletions(-)


diff --git a/fs/attr.c b/fs/attr.c
index d22e8187477f..1fcfdcc5b367 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -233,19 +233,18 @@ int notify_change(struct dentry * dentry, struct iattr * 
attr, struct inode **de
 
WARN_ON_ONCE(!inode_is_locked(inode));
 
-   if (ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) {
-   if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-   return -EPERM;
-   }
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
+   if ((ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID | ATTR_TIMES_SET)) &&
+   IS_APPEND(inode))
+   return -EPERM;
 
/*
 * If utimes(2) and friends are called with times == NULL (or both
 * times are UTIME_NOW), then we need to check for write permission
 */
if (ia_valid & ATTR_TOUCH) {
-   if (IS_IMMUTABLE(inode))
-   return -EPERM;
-
if (!inode_owner_or_capable(inode)) {
error = inode_permission(inode, MAY_WRITE);
if (error)
diff --git a/mm/filemap.c b/mm/filemap.c
index df2006ba0cfa..65433c7ab1d8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2940,6 +2940,9 @@ inline ssize_t generic_write_checks(struct kiocb *iocb, 
struct iov_iter *from)
loff_t count;
int ret;
 
+   if (IS_IMMUTABLE(inode))
+   return -EPERM;
+
if (!iov_iter_count(from))
return 0;
 
diff --git a/mm/memory.c b/mm/memory.c
index ddf20bd0c317..4311cfdade90 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2235,6 +2235,9 @@ static vm_fault_t do_page_mkwrite(struct vm_fault *vmf)
 
vmf->flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
 
+   if (vmf->vma->vm_file && IS_IMMUTABLE(file_inode(vmf->vma->vm_file)))
+   return VM_FAULT_SIGBUS;
+
ret = vmf->vma->vm_ops->page_mkwrite(vmf);
/* Restore original flags so that caller is not surprised */
vmf->flags = old_flags;
diff --git a/mm/mmap.c b/mm/mmap.c
index 7e8c3e8ae75f..ac1e32205237 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1483,8 +1483,12 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
case MAP_SHARED_VALIDATE:
if (flags & ~flags_mask)
return -EOPNOTSUPP;
-   if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
-   return -EACCES;
+   if (prot & PROT_WRITE) {
+   if (!(file->f_mode & FMODE_WRITE))
+   return -EACCES;
+   if (IS_IMMUTABLE(file_inode(file)))
+   return -EPERM;
+   }
 
/*
 * Make sure we don't allow writing to an append-only

[PATCH 4/6] vfs: don't allow most setxattr to immutable files

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

The chattr manpage has this to say about immutable files:

"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."

However, we don't actually check the immutable flag in the setattr code,
which means that we can update inode flags and project ids and extent
size hints on supposedly immutable files.  Therefore, reject setflags
and fssetxattr calls on an immutable file if the file is immutable and
will remain that way.

Signed-off-by: Darrick J. Wong 
---
 fs/inode.c |   31 +++
 1 file changed, 31 insertions(+)


diff --git a/fs/inode.c b/fs/inode.c
index a3757051fd55..adfb458bf533 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2184,6 +2184,17 @@ int vfs_ioc_setflags_check(struct inode *inode, int 
oldflags, int flags)
!capable(CAP_LINUX_IMMUTABLE))
return -EPERM;
 
+   /*
+* We aren't allowed to change any other flags if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((oldflags & FS_IMMUTABLE_FL) &&
+   (flags & FS_IMMUTABLE_FL)) {
+   if ((oldflags & ~FS_IMMUTABLE_FL) !=
+   (flags & ~FS_IMMUTABLE_FL))
+   return -EPERM;
+   }
+
return 0;
 }
 EXPORT_SYMBOL(vfs_ioc_setflags_check);
@@ -2226,6 +2237,26 @@ int vfs_ioc_fssetxattr_check(struct inode *inode, const 
struct fsxattr *old_fa,
!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
return -EINVAL;
 
+   /*
+* We aren't allowed to change any fields if the immutable flag is
+* already set and is not being unset.
+*/
+   if ((old_fa->fsx_xflags & FS_XFLAG_IMMUTABLE) &&
+   (fa->fsx_xflags & FS_XFLAG_IMMUTABLE)) {
+   if ((old_fa->fsx_xflags & ~FS_XFLAG_IMMUTABLE) !=
+   (fa->fsx_xflags & ~FS_XFLAG_IMMUTABLE))
+   return -EPERM;
+   if (old_fa->fsx_projid != fa->fsx_projid)
+   return -EPERM;
+   if ((fa->fsx_xflags & (FS_XFLAG_EXTSIZE |
+  FS_XFLAG_EXTSZINHERIT)) &&
+   old_fa->fsx_extsize != fa->fsx_extsize)
+   return -EPERM;
+   if ((old_fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) &&
+   old_fa->fsx_cowextsize != fa->fsx_cowextsize)
+   return -EPERM;
+   }
+
/* Extent size hints of zero turn off the flags. */
if (fa->fsx_extsize == 0)
fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT);

[PATCH 5/6] xfs: refactor setflags to use setattr code directly

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

Refactor the SETFLAGS implementation to use the SETXATTR code directly
instead of partially constructing a struct fsxattr and calling bits and
pieces of the setxattr code.  This reduces code size and becomes
necessary in the next patch to maintain the behavior of allowing
userspace to set immutable on an immutable file so long as nothing
/else/ about the attributes change.

Signed-off-by: Darrick J. Wong 
---
 fs/xfs/xfs_ioctl.c |   40 +++-
 1 file changed, 3 insertions(+), 37 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 88583b3e1e76..7b19ba2956ad 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1491,11 +1491,8 @@ xfs_ioc_setxflags(
struct file *filp,
void__user *arg)
 {
-   struct xfs_trans*tp;
struct fsxattr  fa;
-   struct fsxattr  old_fa;
unsigned intflags;
-   int join_flags = 0;
int error;
 
if (copy_from_user(&flags, arg, sizeof(flags)))
@@ -1506,44 +1503,13 @@ xfs_ioc_setxflags(
  FS_SYNC_FL))
return -EOPNOTSUPP;
 
-   fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
+   __xfs_ioc_fsgetxattr(ip, false, &fa);
+   fa.fsx_xflags = xfs_merge_ioc_xflags(flags, fa.fsx_xflags);
 
error = mnt_want_write_file(filp);
if (error)
return error;
-
-   /*
-* Changing DAX config may require inode locking for mapping
-* invalidation. These need to be held all the way to transaction commit
-* or cancel time, so need to be passed through to
-* xfs_ioctl_setattr_get_trans() so it can apply them to the join call
-* appropriately.
-*/
-   error = xfs_ioctl_setattr_dax_invalidate(ip, &fa, &join_flags);
-   if (error)
-   goto out_drop_write;
-
-   tp = xfs_ioctl_setattr_get_trans(ip, join_flags);
-   if (IS_ERR(tp)) {
-   error = PTR_ERR(tp);
-   goto out_drop_write;
-   }
-
-   __xfs_ioc_fsgetxattr(ip, false, &old_fa);
-   error = vfs_ioc_fssetxattr_check(VFS_I(ip), &old_fa, &fa);
-   if (error) {
-   xfs_trans_cancel(tp);
-   goto out_drop_write;
-   }
-
-   error = xfs_ioctl_setattr_xflags(tp, ip, &fa);
-   if (error) {
-   xfs_trans_cancel(tp);
-   goto out_drop_write;
-   }
-
-   error = xfs_trans_commit(tp);
-out_drop_write:
+   error = xfs_ioctl_setattr(ip, &fa);
mnt_drop_write_file(filp);
return error;
 }

[PATCH 1/4] vfs: create a generic checking function for FS_IOC_SETFLAGS

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

Create a generic checking function for the incoming FS_IOC_SETFLAGS flag
values so that we can standardize the implementations that follow ext4's
flag values.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c|   13 +
 fs/efivarfs/file.c  |   18 +-
 fs/ext2/ioctl.c |   16 
 fs/ext4/ioctl.c |   13 +++--
 fs/f2fs/file.c  |7 ---
 fs/gfs2/file.c  |   42 +-
 fs/hfsplus/ioctl.c  |   21 -
 fs/inode.c  |   17 +
 fs/jfs/ioctl.c  |6 ++
 fs/nilfs2/ioctl.c   |9 ++---
 fs/ocfs2/ioctl.c|   13 +++--
 fs/reiserfs/ioctl.c |   10 --
 fs/ubifs/ioctl.c|   13 +++--
 include/linux/fs.h  |2 ++
 14 files changed, 107 insertions(+), 93 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 6dafa857bbb9..f408aa93b0cf 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -187,7 +187,7 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
struct btrfs_inode *binode = BTRFS_I(inode);
struct btrfs_root *root = binode->root;
struct btrfs_trans_handle *trans;
-   unsigned int fsflags;
+   unsigned int fsflags, old_fsflags;
int ret;
const char *comp = NULL;
u32 binode_flags = binode->flags;
@@ -212,13 +212,10 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
inode_lock(inode);
 
fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
-   if ((fsflags ^ btrfs_inode_flags_to_fsflags(binode->flags)) &
-   (FS_APPEND_FL | FS_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE)) {
-   ret = -EPERM;
-   goto out_unlock;
-   }
-   }
+   old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
+   ret = vfs_ioc_setflags_check(inode, old_fsflags, fsflags);
+   if (ret)
+   goto out_unlock;
 
if (fsflags & FS_SYNC_FL)
binode_flags |= BTRFS_INODE_SYNC;
diff --git a/fs/efivarfs/file.c b/fs/efivarfs/file.c
index 8e568428c88b..f4f6c1bec132 100644
--- a/fs/efivarfs/file.c
+++ b/fs/efivarfs/file.c
@@ -110,16 +110,22 @@ static ssize_t efivarfs_file_read(struct file *file, char 
__user *userbuf,
return size;
 }
 
-static int
-efivarfs_ioc_getxflags(struct file *file, void __user *arg)
+static inline unsigned int efivarfs_getflags(struct inode *inode)
 {
-   struct inode *inode = file->f_mapping->host;
unsigned int i_flags;
unsigned int flags = 0;
 
i_flags = inode->i_flags;
if (i_flags & S_IMMUTABLE)
flags |= FS_IMMUTABLE_FL;
+   return flags;
+}
+
+static int
+efivarfs_ioc_getxflags(struct file *file, void __user *arg)
+{
+   struct inode *inode = file->f_mapping->host;
+   unsigned int flags = efivarfs_getflags(inode);
 
if (copy_to_user(arg, &flags, sizeof(flags)))
return -EFAULT;
@@ -132,6 +138,7 @@ efivarfs_ioc_setxflags(struct file *file, void __user *arg)
struct inode *inode = file->f_mapping->host;
unsigned int flags;
unsigned int i_flags = 0;
+   unsigned int oldflags = efivarfs_getflags(inode);
int error;
 
if (!inode_owner_or_capable(inode))
@@ -143,8 +150,9 @@ efivarfs_ioc_setxflags(struct file *file, void __user *arg)
if (flags & ~FS_IMMUTABLE_FL)
return -EOPNOTSUPP;
 
-   if (!capable(CAP_LINUX_IMMUTABLE))
-   return -EPERM;
+   error = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (error)
+   return error;
 
if (flags & FS_IMMUTABLE_FL)
i_flags |= S_IMMUTABLE;
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 0367c0039e68..88b3b9720023 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -60,18 +60,10 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
}
oldflags = ei->i_flags;
 
-   /*
-* The IMMUTABLE and APPEND_ONLY flags can only be changed by
-* the relevant capability.
-*
-* This test looks nicer. Thanks to Pauline Middelink
-*/
-   if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
-   if (!capable(CAP_LINUX_IMMUTABLE)) {
-   inode_unlock(inode);
-   ret = -EPERM;
-   goto setflags_out;
-   }
+   ret = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (ret) {
+   inode_unlock(inode);
+   goto setflags_out;
}
 
flags = flags & EXT2

[PATCH v3 0/6] vfs: make immutable files actually immutable

2019-06-10 Thread Darrick J. Wong

Hi all,

The chattr(1) manpage has this to say about the immutable bit that
system administrators can set on files:

Note that various distro manpages points out the inconsistent behavior
of the various Linux filesystems w.r.t. immutable. This fixes all that.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This has been lightly tested with fstests. Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=immutable-files

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=immutable-files

[PATCH 2/6] vfs: flush and wait for io when setting the immutable flag via SETFLAGS

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_SETFLAGS to set the immutable flag on a file, we
need to ensure that userspace can't continue to write the file after the
file becomes immutable.  To make that happen, we have to flush all the
dirty pagecache pages to disk to ensure that we can fail a page fault on
a mmap'd region, wait for pending directio to complete, and hope the
caller locked out any new writes by holding the inode lock.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c   |3 +++
 fs/efivarfs/file.c |5 +
 fs/ext2/ioctl.c|5 +
 fs/ext4/ioctl.c|3 +++
 fs/f2fs/file.c |3 +++
 fs/hfsplus/ioctl.c |3 +++
 fs/nilfs2/ioctl.c  |3 +++
 fs/ocfs2/ioctl.c   |3 +++
 fs/orangefs/file.c |   11 ---
 fs/orangefs/protocol.h |3 +++
 fs/reiserfs/ioctl.c|3 +++
 fs/ubifs/ioctl.c   |3 +++
 include/linux/fs.h |   34 ++
 13 files changed, 79 insertions(+), 3 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7ddda5b4b6a6..f431813b2454 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -214,6 +214,9 @@ static int btrfs_ioctl_setflags(struct file *file, void 
__user *arg)
fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
ret = vfs_ioc_setflags_check(inode, old_fsflags, fsflags);
+   if (ret)
+   goto out_unlock;
+   ret = vfs_ioc_setflags_flush_data(inode, fsflags);
if (ret)
goto out_unlock;
 
diff --git a/fs/efivarfs/file.c b/fs/efivarfs/file.c
index f4f6c1bec132..845016a67724 100644
--- a/fs/efivarfs/file.c
+++ b/fs/efivarfs/file.c
@@ -163,6 +163,11 @@ efivarfs_ioc_setxflags(struct file *file, void __user *arg)
return error;
 
inode_lock(inode);
+   error = vfs_ioc_setflags_flush_data(inode, flags);
+   if (error) {
+   inode_unlock(inode);
+   return error;
+   }
inode_set_flags(inode, i_flags, S_IMMUTABLE);
inode_unlock(inode);
 
diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c
index 88b3b9720023..75f75619237c 100644
--- a/fs/ext2/ioctl.c
+++ b/fs/ext2/ioctl.c
@@ -65,6 +65,11 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
inode_unlock(inode);
goto setflags_out;
}
+   ret = vfs_ioc_setflags_flush_data(inode, flags);
+   if (ret) {
+   inode_unlock(inode);
+   goto setflags_out;
+   }
 
flags = flags & EXT2_FL_USER_MODIFIABLE;
flags |= oldflags & ~EXT2_FL_USER_MODIFIABLE;
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 6aa1df1918f7..a05341b94d98 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -290,6 +290,9 @@ static int ext4_ioctl_setflags(struct inode *inode,
jflag = flags & EXT4_JOURNAL_DATA_FL;
 
err = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (err)
+   goto flags_out;
+   err = vfs_ioc_setflags_flush_data(inode, flags);
if (err)
goto flags_out;
 
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 183ed1ac60e1..d3cf4bdb8738 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1681,6 +1681,9 @@ static int __f2fs_ioc_setflags(struct inode *inode, 
unsigned int flags)
oldflags = fi->i_flags;
 
err = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (err)
+   return err;
+   err = vfs_ioc_setflags_flush_data(inode, flags);
if (err)
return err;
 
diff --git a/fs/hfsplus/ioctl.c b/fs/hfsplus/ioctl.c
index 862a3c9481d7..f8295fa35237 100644
--- a/fs/hfsplus/ioctl.c
+++ b/fs/hfsplus/ioctl.c
@@ -104,6 +104,9 @@ static int hfsplus_ioctl_setflags(struct file *file, int 
__user *user_flags)
inode_lock(inode);
 
err = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (err)
+   goto out_unlock_inode;
+   err = vfs_ioc_setflags_flush_data(inode, flags);
if (err)
goto out_unlock_inode;
 
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 0632336d2515..a3c200ab9f60 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -149,6 +149,9 @@ static int nilfs_ioctl_setflags(struct inode *inode, struct 
file *filp,
oldflags = NILFS_I(inode)->i_flags;
 
ret = vfs_ioc_setflags_check(inode, oldflags, flags);
+   if (ret)
+   goto out;
+   ret = vfs_ioc_setflags_flush_data(inode, flags);
if (ret)
goto out;
 
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index 467a2faf0305..e91ca0dad3d7 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -107,6 +107,9 @@ static int ocfs2_set_inode_attr(struct inode *inode, 
unsigned flags,
flags |

[PATCH 3/6] vfs: flush and wait for io when setting the immutable flag via FSSETXATTR

2019-06-10 Thread Darrick J. Wong

From: Darrick J. Wong 

When we're using FS_IOC_FSSETXATTR to set the immutable flag on a file,
we need to ensure that userspace can't continue to write the file after
the file becomes immutable.  To make that happen, we have to flush all
the dirty pagecache pages to disk to ensure that we can fail a page
fault on a mmap'd region, wait for pending directio to complete, and
hope the caller locked out any new writes by holding the inode lock.

Signed-off-by: Darrick J. Wong 
---
 fs/btrfs/ioctl.c   |3 +++
 fs/ext4/ioctl.c|3 +++
 fs/f2fs/file.c |3 +++
 fs/xfs/xfs_ioctl.c |   39 +--
 include/linux/fs.h |   23 +++
 5 files changed, 65 insertions(+), 6 deletions(-)


diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f431813b2454..63a9281e6ce0 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -432,6 +432,9 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void 
__user *arg)
 
__btrfs_ioctl_fsgetxattr(binode, &old_fa);
ret = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (ret)
+   goto out_unlock;
+   ret = vfs_ioc_fssetxattr_flush_data(inode, &fa);
if (ret)
goto out_unlock;
 
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a05341b94d98..6037585c1520 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -1115,6 +1115,9 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, 
unsigned long arg)
inode_lock(inode);
ext4_fsgetxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (err)
+   goto out;
+   err = vfs_ioc_fssetxattr_flush_data(inode, &fa);
if (err)
goto out;
flags = (ei->i_flags & ~EXT4_FL_XFLAG_VISIBLE) |
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index d3cf4bdb8738..97f4bb36540f 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -2832,6 +2832,9 @@ static int f2fs_ioc_fssetxattr(struct file *filp, 
unsigned long arg)
 
__f2fs_ioc_fsgetxattr(inode, &old_fa);
err = vfs_ioc_fssetxattr_check(inode, &old_fa, &fa);
+   if (err)
+   goto out;
+   err = vfs_ioc_fssetxattr_flush_data(inode, &fa);
if (err)
goto out;
flags = (fi->i_flags & ~F2FS_FL_XFLAG_VISIBLE) |
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index b494e7e881e3..88583b3e1e76 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1014,6 +1014,28 @@ xfs_diflags_to_linux(
 #endif
 }
 
+/*
+ * Lock the inode against file io and page faults, then flush all dirty pages
+ * and wait for writeback and direct IO operations to finish.  Returns with
+ * the relevant inode lock flags set in @join_flags.  Caller is responsible for
+ * unlocking even on error return.
+ */
+static int
+xfs_ioctl_setattr_flush(
+   struct xfs_inode*ip,
+   int *join_flags)
+{
+   /* Already locked the inode from IO?  Assume we're done. */
+   if (((*join_flags) & (XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL)) ==
+(XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL))
+   return 0;
+
+   /* Lock and flush all mappings and IO in preparation for flag change */
+   *join_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+   xfs_ilock(ip, *join_flags);
+   return inode_flush_data(VFS_I(ip));
+}
+
 static int
 xfs_ioctl_setattr_xflags(
struct xfs_trans*tp,
@@ -1099,23 +1121,22 @@ xfs_ioctl_setattr_dax_invalidate(
if (!(fa->fsx_xflags & FS_XFLAG_DAX) && !IS_DAX(inode))
return 0;
 
-   if (S_ISDIR(inode->i_mode))
+   if (!S_ISREG(inode->i_mode))
return 0;
 
-   /* lock, flush and invalidate mapping in preparation for flag change */
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
-   error = filemap_write_and_wait(inode->i_mapping);
+   error = xfs_ioctl_setattr_flush(ip, join_flags);
if (error)
goto out_unlock;
error = invalidate_inode_pages2(inode->i_mapping);
if (error)
goto out_unlock;
 
-   *join_flags = XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL;
return 0;
 
 out_unlock:
-   xfs_iunlock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
+   if (*join_flags)
+   xfs_iunlock(ip, *join_flags);
+   *join_flags = 0;
return error;
 
 }
@@ -1337,6 +1358,12 @@ xfs_ioctl_setattr(
if (code)
goto error_free_dquots;
 
+   if (!join_flags && vfs_ioc_fssetxattr_need_flush(VFS_I(ip), fa)) {
+   code = xfs_ioctl_setattr_flush(ip, &join_flags);
+   if (code)
+   goto error_free_dquots;
+   }
+
tp = xfs_ioctl_setattr_get_trans(ip, join_fla

Re: [PATCH 1/8] mm/fs: don't allow writes to immutable files

2019-06-10 Thread Darrick J. Wong

On Mon, Jun 10, 2019 at 04:41:54PM -0400, Theodore Ts'o wrote:
> On Mon, Jun 10, 2019 at 09:09:34AM -0700, Darrick J. Wong wrote:
> > > I was planning on only taking 8/8 through the ext4 tree.  I also added
> > > a patch which filtered writes, truncates, and page_mkwrites (but not
> > > mmap) for immutable files at the ext4 level.
> > 
> > *Oh*.  I saw your reply attached to the 1/8 patch and thought that was
> > the one you were taking.  I was sort of surprised, tbh. :)
> 
> Sorry, my bad.  I mis-replied to the wrong e-mail message  :-)

Also ... after flailing around with the v2 series I decided that it
would be much less work to refactor all the current implementations to
call a common parameter-checking function, which will hopefully make the
behavior of SETFLAGS and FSSETXATTR more consistent across filesystems.

That makes the immutable series much less code and fewer patches, but
also means that the 8/8 patch isn't needed anymore.

I'm about to send both out.

--D

> > > I *could* take this patch through the mm/fs tree, but I wasn't sure
> > > what your plans were for the rest of the patch series, and it seemed
> > > like it hadn't gotten much review/attention from other fs or mm folks
> > > (well, I guess Brian Foster weighed in).
> > 
> > > What do you think?
> > 
> > Not sure.  The comments attached to the LWN story were sort of nasty,
> > and now that a couple of people said "Oh, well, Debian documented the
> > inconsistent behavior so just let it be" I haven't felt like
> > resurrecting the series for 5.3.
> 
> Ah, I had missed the LWN article.   
> 
> Yeah, it's the same set of issues that we had discussed when this
> first came up.  We can go round and round on this one; It's true that
> root can now cause random programs which have a file mmap'ed for
> writing to seg fault, but root has a million ways of killing and
> otherwise harming running application programs, and it's unlikely
> files get marked for immutable all that often.  We just have to pick
> one way of doing things, and let it be same across all the file
> systems.
> 
> My understanding was that XFS had chosen to make the inode immutable
> as soon as the flag is set (as opposed to forbidding new fd's to be
> opened which were writeable), and I was OK moving ext4 to that common
> interpretation of the immmutable bit, even though it would be a change
> to ext4.
> 
> And then when I saw that Amir had included a patch that would cause
> test failures unless that patch series was applied, it seemed that we
> had all thought that the change was a done deal.  Perhaps we should
> have had a more explicit discussion when the test was sent for review,
> but I had assumed it was exclusively a copy_file_range set of tests,
> so I didn't realize it was going to cause ext4 failures.
> 
>- Ted

1 2 3 4 5 6 7 8 9 >

1 - 100 of 804 matches

Mail list logo