[PATCH] Documentation: document d_prune op in vfs.txt

2019-04-01 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt | 4 
 1 file changed, 4 insertions(+)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index 761c6fd24a53..4f1638e5f95b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -1013,6 +1013,7 @@ struct dentry_operations {
int (*d_delete)(const struct dentry *);
int (*d_init)(struct dentry *);
void (*d_release)(struct dentry *);
+   void (*d_prune)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
@@ -1087,6 +1088,9 @@ struct dentry_operations {
 
   d_release: called when a dentry is really deallocated
 
+  d_prune: called prior to pruning (i.e. unhashing and killing) a hashed
+   dentry from the dcache.
+
   d_iput: called when a dentry loses its inode (just prior to its
being deallocated). The default when this is NULL is that the
VFS calls iput(). If you define this method, you must call
-- 
2.20.1



Re: [PATCH] Add errseq_t documentation to the tree

2017-12-22 Thread Jeff Layton
On Fri, 2017-12-22 at 05:04 -0800, Matthew Wilcox wrote:
>  - Add it under 'Core API' because I think that's where it lives.
>  - Promote the header to a more prominent header type, otherwise we get three
>entries in the table of contents.
>  - Reformat the table to look nicer and be a little more proportional in
>terms of horizontal width per bit (the SF bit is still disproportionately
>large, but there's no way to fix that).
> 
> Signed-off-by: Matthew Wilcox 
> 
> diff --git a/Documentation/core-api/index.rst 
> b/Documentation/core-api/index.rst
> index eb16ba30aeb6..b8ec120c24f9 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -22,6 +22,7 @@ Core utilities
> flexible-arrays
> librs
> genalloc
> +   ../errseq
> 

Should we also move the file into core-api/ dir?

>  
>  Interfaces for kernel debugging
>  ===
> diff --git a/Documentation/errseq.rst b/Documentation/errseq.rst
> index 4c29bd5afbc5..7c3ac9639ebf 100644
> --- a/Documentation/errseq.rst
> +++ b/Documentation/errseq.rst
> @@ -1,5 +1,7 @@
> +=
>  The errseq_t datatype
>  =
> +
>  An errseq_t is a way of recording errors in one place, and allowing any
>  number of "subscribers" to tell whether it has changed since a previous
>  point where it was sampled.
> @@ -21,12 +23,13 @@ a flag to tell whether the value has been sampled since a 
> new value was
>  recorded.  That allows us to avoid bumping the counter if no one has
>  sampled it since the last time an error was recorded.
>  
> -Thus we end up with a value that looks something like this::
> +Thus we end up with a value that looks something like this:
>  
> -bit:  31..131211..0
> -+-+++
> -| counter | SF |  errno |
> -+-+++
> ++--+++
> +| 31..13   | 12 | 11..0  |
> ++--+++
> +| counter  | SF | errno  |
> ++--+++
>  
>  The general idea is for "watchers" to sample an errseq_t value and keep
>  it as a running cursor.  That value can later be used to tell whether
> @@ -42,6 +45,7 @@ has ever been an error set since it was first initialized.
>  
>  API usage
>  =
> +
>  Let me tell you a story about a worker drone.  Now, he's a good worker
>  overall, but the company is a little...management heavy.  He has to
>  report to 77 supervisors today, and tomorrow the "big boss" is coming in
> @@ -125,6 +129,7 @@ not usable by anyone else.
>  
>  Serializing errseq_t cursor updates
>  ===
> +
>  Note that the errseq_t API does not protect the errseq_t cursor during a
>  check_and_advance_operation. Only the canonical error code is handled
>  atomically.  In a situation where more than one task might be using the

Thanks for the cleanup, looks good.

Reviewed-by: Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Documentation: filesystems: update filesystem locking documentation

2017-08-01 Thread Jeff Layton
On Tue, 2017-08-01 at 07:09 -0400, Sean Anderson wrote:
> Documentation/filesystems/Locking no longer reflects current locking
> semantics. i_mutex is no longer used for locking, and has been superseded
> by i_rwsem. Additionally, ->iterate_shared() was not documented.
> 
> Signed-off-by: Sean Anderson 
> ---
> v2: changed 'yes's to 'exclusive's when describing i_rwsem usage
> 
>  Documentation/filesystems/Locking | 43 
> ++-
>  1 file changed, 24 insertions(+), 19 deletions(-)
> 
> diff --git a/Documentation/filesystems/Locking 
> b/Documentation/filesystems/Locking
> index fe25787ff6d4..c0cab97d2b1a 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -69,31 +69,31 @@ prototypes:
>  
>  locking rules:
>   all may block
> - i_mutex(inode)
> -lookup:  yes
> -create:  yes
> -link:yes (both)
> -mknod:   yes
> -symlink: yes
> -mkdir:   yes
> -unlink:  yes (both)
> -rmdir:   yes (both)  (see below)
> -rename:  yes (all)   (see below)
> + i_rwsem(inode)
> +lookup:  shared
> +create:  exclusive
> +link:exclusive (both)
> +mknod:   exclusive
> +symlink: exclusive
> +mkdir:   exclusive
> +unlink:  exclusive (both)
> +rmdir:   exclusive (both)(see below)
> +rename:  exclusive (all) (see below)
>  readlink:no
>  get_link:no
> -setattr: yes
> +setattr: exclusive
>  permission:  no (may not block if called in rcu-walk mode)
>  get_acl: no
>  getattr: no
>  listxattr:   no
>  fiemap:  no
>  update_time: no
> -atomic_open: yes
> +atomic_open: exclusive
>  tmpfile: no
>  
>  
> - Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
> -victim.
> + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
> + exclusive on victim.
>   cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
>  
>  See Documentation/filesystems/directory-locking for more detailed discussion
> @@ -111,10 +111,10 @@ prototypes:
>  
>  locking rules:
>   all may block
> - i_mutex(inode)
> + i_rwsem(inode)
>  list:no
>  get: no
> -set: yes
> +set: exclusive
>  
>  --- super_operations ---
>  prototypes:
> @@ -217,14 +217,14 @@ prototypes:
>  locking rules:
>   All except set_page_dirty and freepage may block
>  
> - PageLocked(page)i_mutex
> + PageLocked(page)i_rwsem
>  writepage:   yes, unlocks (see below)
>  readpage:yes, unlocks
>  writepages:
>  set_page_dirty   no
>  readpages:
> -write_begin: locks the page  yes
> -write_end:   yes, unlocksyes
> +write_begin: locks the page  exclusive
> +write_end:   yes, unlocksexclusive
>  bmap:
>  invalidatepage:  yes
>  releasepage: yes
> @@ -439,6 +439,7 @@ prototypes:
>   ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>   ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
>   int (*iterate) (struct file *, struct dir_context *);
> + int (*iterate_shared) (struct file *, struct dir_context *);
>   unsigned int (*poll) (struct file *, struct poll_table_struct *);
>   long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>   long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> @@ -480,6 +481,10 @@ mutex or just to use i_size_read() instead.
>  Note: this does not protect the file->f_pos against concurrent modifications
>  since this is something the userspace has to take care about.
>  
> +->iterate() is called with i_rwsem exclusive.
> +
> +->iterate_shared() is called with i_rwsem at least shared.
> +
>  ->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags.
>  Most instances call fasync_helper(), which does that maintenance, so it's
>  not normally something one needs to worry about.  Return values > 0 will be

Reviewed-by: Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 5/5] btrfs: allow it to use $SCRATCH_LOGDEV

2017-06-08 Thread Jeff Layton
On Tue, 2017-06-06 at 17:19 +0800, Eryu Guan wrote:
> On Wed, May 31, 2017 at 09:08:20AM -0400, Jeff Layton wrote:
> > With btrfs, we can't really put the log on a separate device. What we
> > can do however is mirror the metadata across two devices and make the
> > data striped across all devices. When we turn on dmerror then the
> > metadata can fall back to using the other mirror while the data errors
> > out.
> > 
> > Note that the current incarnation of btrfs has a fixed 64k stripe
> > width. If that ever changes or becomes settable, we may need to adjust
> > the amount of data that the test program writes.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  common/rc | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/common/rc b/common/rc
> > index 83765aacfb06..078270451b53 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -830,6 +830,8 @@ _scratch_mkfs()
> > ;;
> > btrfs)
> > mkfs_cmd="$MKFS_BTRFS_PROG"
> > +   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
> > +   mkfs_cmd="$mkfs_cmd -d raid0 -m raid1 $SCRATCH_LOGDEV"
> 
> I don't think this is the correct way to do it. If btrfs doesn't support
> external log device, then this test doesn't fit btrfs, or we need other
> ways to test btrfs.
> 
> One of the problems of this hack is that raid1 requires all devices are
> in the same size, we have a _require_scratch_dev_pool_equal_size() rule
> to check on it, but this hack doesn't do the proper check and test fails
> if SCRATCH_LOGDEV is smaller or bigger in size.
> 
> If btrfs "-d raid0 -m raid1" is capable to do this writeback error test,
> perhaps you can write a new btrfs test and mkfs with "-d raid0 -m raid1"
> explicitly. e.g.
> 
> ...
> _require_scratch_dev_pool 2
> _require_scratch_dev_pool_equal_size
> ...
> _scratch_mkfs "-d raid0 -m raid1"
> ...
> 
> Thanks,
> Eryu


Yeah, that's probably the right way to do this. It looks like btrfs also
has $SCRATCH_DEV_POOL, and we can probably base it on that. I'll look at
reworking it.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Jeff Layton
On Tue, 2017-06-06 at 10:17 -0700, Darrick J. Wong wrote:
> On Tue, Jun 06, 2017 at 08:23:25PM +0800, Eryu Guan wrote:
> > On Tue, Jun 06, 2017 at 06:15:57AM -0400, Jeff Layton wrote:
> > > On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> > > > On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > > > > I'm working on a set of kernel patches to change how writeback errors
> > > > > are handled and reported in the kernel. Instead of reporting a
> > > > > writeback error to only the first fsync caller on the file, I aim
> > > > > to make the kernel report them once on every file description.
> > > > > 
> > > > > This patch adds a test for the new behavior. Basically, open many fds
> > > > > to the same file, turn on dm_error, write to each of the fds, and then
> > > > > fsync them all to ensure that they all get an error back.
> > > > > 
> > > > > To do that, I'm adding a new tools/dmerror script that the C program
> > > > > can use to load the error table. For now, that's all it can do, but
> > > > > we can fill it out with other commands as necessary.
> > > > > 
> > > > > Signed-off-by: Jeff Layton 
> > > > 
> > > > Thanks for the new tests! And sorry for the late review..
> > > > 
> > > > It's testing a new behavior on error reporting on writeback, I'm not
> > > > sure if we can call it a new feature or it fixed a bug? But it's more
> > > > like a behavior change, I'm not sure how to categorize it.
> > > > 
> > > > Because if it's testing a new feature, we usually let test do proper
> > > > detection of current test environment (based on actual behavior not
> > > > kernel version) and _notrun on filesystems that don't have this feature
> > > > yet, instead of failing the test; if it's testing a bug fix, we could
> > > > leave the test fail on unfixed filesystems, this also serves as a
> > > > reminder that there's bug to fix.
> > > > 
> > > 
> > > Thanks for the review! I'm not sure how to categorize this either. Since
> > > the plan is to convert all the filesystems piecemeal, maybe we should
> > > just consider it a new feature.
> > 
> > Then we need a new _require rule to properly detect for the 'feature'
> > support. I'm not sure if this is doable, but something like
> > _require_statx, _require_seek_data_hole would be good.
> > 
> > > 
> > > > I pulled your test kernel tree, and test passed on EXT4 but failed on
> > > > other local filesystems (XFS, btrfs). I assume that's expected.
> > > > 
> > > > Besides this kinda high-level question, some minor comments inline.
> > > > 
> > > 
> > > Yes, ext4 should pass on my latest kernel tree, but everything else
> > > should fail. 

Oh, and I should mention that ext2/3 also pass when mounted using ext4
driver. Legacy ext2 driver sort of works, but it reports a few too many
errors because of the way the ext2_error macro works. That shouldn't be
too hard to fix, I just need some guidance on that one.

I had xfs and btrfs working with an earlier iteration of the patches,
but now that we're converting a fs at a time, it's a little more work to
get there. It shouldn't be too hard to do though. I'll probably re-post
in a few days, and will try to take a stab at XFS and btrfs conversion
too.

> > 
> > With the new _require rule, test should _notrun on XFS and btrfs then.
> 
> Frankly I personally prefer that upstream XFS fails until someone fixes it. :)
> (But that's just my opinion.)
> 
> That said, I'm not 100% sure what's required of XFS to play nicely with
> this new mechanism -- glancing at the ext* patches it looks like we'd
> need to set a fs flag and possibly change some or all of the "write
> cached dirty buffers out to disk" calls to their _since variants?

Yeah, that's pretty much the size of it.

In fact, the latter part (changing to the _since variants) is somewhat
optional. We can have the errseq_t based tracking coexist with the
AS_EIO/AS_ENOSPC flags. It's weird but I don't see a real downside to
preserving them until we've got more of this converted over.

In the latest branch I'm working on, I'm breaking up those changes into
different patches so it should be a little clearer for other fs
maintainers to see what I'm doing and why. Stay tuned...

> Metadata writeback errors are handled by retrying writes and/or shutting
> down the fs, so I think the f_md_wb_error case is already covered.

Thanks. I think we do need f_md_wb_err for ext2/4 though, IIUC?

> 
> That said, I agree that it's useful to detect that the kernel simply
> lacks any of the new wb error reporting at all, so therefore we can skip
> the tests.
> 

Suggestions on ways to implement such a check would be welcome. Maybe a
file in /sys or in debugfs?

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Jeff Layton
On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > I'm working on a set of kernel patches to change how writeback errors
> > are handled and reported in the kernel. Instead of reporting a
> > writeback error to only the first fsync caller on the file, I aim
> > to make the kernel report them once on every file description.
> > 
> > This patch adds a test for the new behavior. Basically, open many fds
> > to the same file, turn on dm_error, write to each of the fds, and then
> > fsync them all to ensure that they all get an error back.
> > 
> > To do that, I'm adding a new tools/dmerror script that the C program
> > can use to load the error table. For now, that's all it can do, but
> > we can fill it out with other commands as necessary.
> > 
> > Signed-off-by: Jeff Layton 
> 
> Thanks for the new tests! And sorry for the late review..
> 
> It's testing a new behavior on error reporting on writeback, I'm not
> sure if we can call it a new feature or it fixed a bug? But it's more
> like a behavior change, I'm not sure how to categorize it.
> 
> Because if it's testing a new feature, we usually let test do proper
> detection of current test environment (based on actual behavior not
> kernel version) and _notrun on filesystems that don't have this feature
> yet, instead of failing the test; if it's testing a bug fix, we could
> leave the test fail on unfixed filesystems, this also serves as a
> reminder that there's bug to fix.
> 

Thanks for the review! I'm not sure how to categorize this either. Since
the plan is to convert all the filesystems piecemeal, maybe we should
just consider it a new feature.

> I pulled your test kernel tree, and test passed on EXT4 but failed on
> other local filesystems (XFS, btrfs). I assume that's expected.
> 
> Besides this kinda high-level question, some minor comments inline.
> 

Yes, ext4 should pass on my latest kernel tree, but everything else
should fail. 

> > ---
> >  common/dmerror |  13 ++--
> >  doc/auxiliary-programs.txt |   8 +++
> >  src/Makefile   |   2 +-
> >  src/fsync-err.c| 161 
> > +
> 
> New binary needs an entry in .gitignore file.
> 

OK, thanks. Will fix.

> >  tests/generic/999  |  76 +
> >  tests/generic/999.out  |   3 +
> >  tests/generic/group|   1 +
> >  tools/dmerror  |  44 +
> 
> This file is used by the test, then it should be in src/ directory and
> be installed along with other executable files on "make install".
> Because files under tools/ are not installed. Most people will run tests
> in the root dir of xfstests and this is not a problem, but there're
> still cases people do "make && make install" and run fstests from
> /var/lib/xfstests (default installation target).
> 

Ok, no problem. I'll move it. I wasn't sure here since dmerror is a
shell script, and most of the stuff in src/ is stuff that needs to be
built.
 
> >  8 files changed, 302 insertions(+), 6 deletions(-)
> >  create mode 100644 src/fsync-err.c
> >  create mode 100755 tests/generic/999
> >  create mode 100644 tests/generic/999.out
> >  create mode 100755 tools/dmerror
> > 
> > diff --git a/common/dmerror b/common/dmerror
> > index d46c5d0b7266..238baa213b1f 100644
> > --- a/common/dmerror
> > +++ b/common/dmerror
> > @@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
> > _notrun "Cannot run tests with DAX on dmerror devices"
> >  fi
> >  
> > -_dmerror_init()
> > +_dmerror_setup()
> >  {
> > local dm_backing_dev=$SCRATCH_DEV
> >  
> > -   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> > -
> > local blk_dev_size=`blockdev --getsz $dm_backing_dev`
> >  
> > DMERROR_DEV='/dev/mapper/error-test'
> >  
> > DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
> >  
> > +   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
> > +}
> > +
> > +_dmerror_init()
> > +{
> > +   _dmerror_setup
> > +   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> > $DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
> > _fatal "failed to create dm linear device"
> > -
> > -   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
> >  }
> >  
> >  _dmerror_mount()
>

Re: [PATCH v5 08/17] dax: set errors in mapping when writeback fails

2017-06-05 Thread Jeff Layton
On Mon, 2017-06-05 at 19:01 -0600, Ross Zwisler wrote:
> On Wed, May 31, 2017 at 08:45:31AM -0400, Jeff Layton wrote:
> > Jan's description for this patch is much better than mine, so I'm
> > quoting it verbatim here:
> > 
> > -8<-
> > DAX currently doesn't set errors in the mapping when cache flushing
> > fails in dax_writeback_mapping_range(). Since this function can get
> > called only from fsync(2) or sync(2), this is actually as good as it can
> > currently get since we correctly propagate the error up from
> > dax_writeback_mapping_range() to filemap_fdatawrite()
> > 
> > However, in the future better writeback error handling will enable us to
> > properly report these errors on fsync(2) even if there are multiple file
> > descriptors open against the file or if sync(2) gets called before
> > fsync(2). So convert DAX to using standard error reporting through the
> > mapping.
> > -8<-
> > 
> > For now, only do this when the FS_WB_ERRSEQ flag is set. The
> > AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when
> > writeback initiation fails, only when we discover an error after waiting
> > on writeback to complete, so we only want to do this with errseq_t based
> > error handling to prevent seeing duplicate errors on fsync.
> > 
> > Signed-off-by: Jeff Layton 
> > Reviewed-by: Jan Kara 
> > Reviewed-by: Christoph Hellwig 
> > Reviewed-and-Tested-by: Ross Zwisler 
> 
> Re-tested this version of the series with some injected DAX errors, and it
> looks good.

Excellent! Thanks very much for helping test it.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-06-02 Thread Jeff Layton
On Thu, 2017-06-01 at 23:25 -0600, Ross Zwisler wrote:
> On Wed, May 31, 2017 at 08:45:23AM -0400, Jeff Layton wrote:
> > v5: don't retrofit old API over the new infrastructure
> > add fstype flag to indicate how wb errors are tracked within that fs
> > add more function variants that take a errseq_t "since" value
> > add second errseq_t to struct file to track metadata wb errors
> > convert ext4 and ext2 to use the new APIs
> > 
> > v4: several more cleanup patches
> > documentation and kerneldoc comment updates
> > fix bugs in gfs2 patches
> > make sync_file_range use same error reporting semantics
> > bugfixes in buffer.c
> > convert nfs to new scheme (maybe bogus, can be dropped)
> > 
> > v3: wb_err_t -> errseq_t conversion
> > clean up places that re-set errors after calling filemap_* functions
> > 
> > v2: introduce wb_err_t, use atomics
> > 
> > This is v5 of the patchset to improve how we're tracking and reporting
> > errors that occur during pagecache writeback. The main difference in
> > this set from the last one is that I've stopped trying to retrofit the
> > old error tracking API on top of the new one. This is more work since
> > we'll have to touch each fs individually, but should be safer as the
> > "since" values used for checking errors will be more deliberate.
> > 
> > There are several situations where the kernel can "lose" errors that
> > occur during writeback, such that fsync will return success even
> > though it failed to write back some data previously. The basic idea
> > here is to have the kernel be more deliberate about the point from
> > which errors are checked to ensure that that doesn't happen.
> > 
> > An additional aim of this set is to change the behavior of fsync in
> > Linux to report writeback errors on all fds instead of just the first
> > one. This allows writers to reliably tell whether their data made it to
> > the backing device without having to coordinate fsync calls with other
> > writers.
> > 
> > To do this, we add a new typedef: errseq_t. This is a 32-bit value
> > that can store an error code, and a sequence number so we can tell
> > whether it has changed since we last sampled it. This allows us to
> > record errors in the address_space and then report those errors only
> > once per file description.
> > 
> > This set just alters block device files, ext4 and the legacy ext2
> > driver. If this general approach seems acceptable, then I'll start
> > converting other filesystems in follow-on patchsets. I'd also like
> > to get this into linux-next as soon as possible to ensure that we're
> > banging out any bugs that might be lurking here.
> > 
> > I also have a couple of xfstests for this as well that I'll re-post
> > soon.
> 
> Can you tell me a baseline that this applies cleanly to, or give me a link to
> a tree with these patches already applied?  I've tried applying it to v4.11,
> linux/master and mmots/master, and so far nothing has worked.

It's basically on top of v4.12-rc3, but it may not apply cleanly
without the pile of individual patches that I sent recently.

It may be best to just pull down the "wberr" branch from my tree here:

git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git

I was originally sending the prep patches as part of this series, but
maintainers weren't picking them up, so I moved to sending them
individually and then sending this pile as its own set.

Many thanks for giving this a look and testing it!
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton
On Wed, 2017-05-31 at 14:37 -0700, Andrew Morton wrote:
> On Wed, 31 May 2017 17:31:49 -0400 Jeff Layton  wrote:
> 
> > On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> > > On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton  wrote:
> > > 
> > > > This is v5 of the patchset to improve how we're tracking and reporting
> > > > errors that occur during pagecache writeback.
> > > 
> > > I'm curious to know how you've been testing this?
> > >  Is that testing
> > > strong enough for us to be confident that all nature of I/O errors
> > > will be reported to userspace?
> > > 
> > 
> > That's a tall order. This is a difficult thing to test as these sorts of
> > errors are pretty rare by nature.
> > 
> > I have an xfstest that I posted just after this set that demonstrates
> > that it works correctly, at least on ext2/3/4 when run by the ext4
> > driver (ext2 legacy driver reports too many errors currently). I had
> > btrfs and xfs working on that test too in an earlier incarnation of this
> > set, so I think we can fix this in them as well without too much
> > difficulty.
> > 
> > I'm happy to run other tests if someone wants to suggest them.
> > 
> > Now, all that said, I don't think this will make things any worse than
> > they are today as far as reporting errors properly to userland goes.
> > It's rather easy for an incidental synchronous writeback request from an
> > internal caller to clear the AS_* flags today. This will at least ensure
> > that we're reporting errors since a well-defined point in time when you
> > call fsync.
> 
> Were you using error injection of some form?  If so, how was that all
> set up?
> 

Yes, it uses dm-error for fault injection.

The test basically does:

1) set up a dm-error device in a working configuration

2) build a scratch filesystem on it, with the log on a different device
in some fashion so metadata writeback will still succeed.

3) open the same file several times

4) flip dm-error device to non-working mode

5) write to each fd

6) fsync each fd

...do you get back an error on each fsync?

It then does a bit more to make sure they're cleared afterward as you'd
expect. That works for most block device based filesystems. I also have
a second xfstest that opens a block device and does the same basic
thing. That also works correctly with this patch series.

I still need to come up with a way to simulate errors on other fs'
though. We may need to plumb in some kernel-level fault injection on
some fs' to do that correctly. Suggestions welcome there.

With this series though, the idea is to convert one filesystem at a
time, so I think that should help mitigate some of the risk.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton
On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton  wrote:
> 
> > This is v5 of the patchset to improve how we're tracking and reporting
> > errors that occur during pagecache writeback.
> 
> I'm curious to know how you've been testing this?

>  Is that testing
> strong enough for us to be confident that all nature of I/O errors
> will be reported to userspace?
> 

That's a tall order. This is a difficult thing to test as these sorts of
errors are pretty rare by nature.

I have an xfstest that I posted just after this set that demonstrates
that it works correctly, at least on ext2/3/4 when run by the ext4
driver (ext2 legacy driver reports too many errors currently). I had
btrfs and xfs working on that test too in an earlier incarnation of this
set, so I think we can fix this in them as well without too much
difficulty.

I'm happy to run other tests if someone wants to suggest them.

Now, all that said, I don't think this will make things any worse than
they are today as far as reporting errors properly to userland goes.
It's rather easy for an incidental synchronous writeback request from an
internal caller to clear the AS_* flags today. This will at least ensure
that we're reporting errors since a well-defined point in time when you
call fsync.
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-05-31 Thread Jeff Layton
On Wed, 2017-05-31 at 11:59 -0700, Eduardo Valentin wrote:
> Hello,
> 
> On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > I'm working on a set of kernel patches to change how writeback errors
> > are handled and reported in the kernel. Instead of reporting a
> > writeback error to only the first fsync caller on the file, I aim
> > to make the kernel report them once on every file description.
> > 
> > This patch adds a test for the new behavior. Basically, open many fds
> > to the same file, turn on dm_error, write to each of the fds, and then
> > fsync them all to ensure that they all get an error back.
> > 
> > To do that, I'm adding a new tools/dmerror script that the C program
> > can use to load the error table. For now, that's all it can do, but
> > we can fill it out with other commands as necessary.
> > 
> > Signed-off-by: Jeff Layton 
> > ---
> >  common/dmerror |  13 ++--
> >  doc/auxiliary-programs.txt |   8 +++
> >  src/Makefile   |   2 +-
> >  src/fsync-err.c| 161 
> > +
> >  tests/generic/999  |  76 +
> >  tests/generic/999.out  |   3 +
> >  tests/generic/group|   1 +
> >  tools/dmerror  |  44 +
> >  8 files changed, 302 insertions(+), 6 deletions(-)
> >  create mode 100644 src/fsync-err.c
> >  create mode 100755 tests/generic/999
> >  create mode 100644 tests/generic/999.out
> >  create mode 100755 tools/dmerror
> > 
> > diff --git a/common/dmerror b/common/dmerror
> > index d46c5d0b7266..238baa213b1f 100644
> > --- a/common/dmerror
> > +++ b/common/dmerror
> > @@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
> > _notrun "Cannot run tests with DAX on dmerror devices"
> >  fi
> >  
> > -_dmerror_init()
> > +_dmerror_setup()
> >  {
> > local dm_backing_dev=$SCRATCH_DEV
> >  
> > -   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> > -
> > local blk_dev_size=`blockdev --getsz $dm_backing_dev`
> >  
> > DMERROR_DEV='/dev/mapper/error-test'
> >  
> > DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
> >  
> > +   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
> > +}
> > +
> > +_dmerror_init()
> > +{
> > +   _dmerror_setup
> > +   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> > $DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
> > _fatal "failed to create dm linear device"
> > -
> > -   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
> >  }
> >  
> >  _dmerror_mount()
> > diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
> > index 21ef118596b6..191ac0596511 100644
> > --- a/doc/auxiliary-programs.txt
> > +++ b/doc/auxiliary-programs.txt
> > @@ -16,6 +16,7 @@ note the dependency with:
> >  Contents:
> >  
> >   - af_unix -- Create an AF_UNIX socket
> > + - fsync-err   -- tests fsync error reporting after failed 
> > writeback
> >   - open_by_handle  -- open_by_handle_at syscall exercise
> >   - stat_test   -- statx syscall exercise
> >   - t_dir_type  -- print directory entries and their file type
> > @@ -30,6 +31,13 @@ af_unix
> >  
> > The af_unix program creates an AF_UNIX socket at the given location.
> >  
> > +fsync-err
> > +   Specialized program for testing how the kernel reports errors that
> > +   occur during writeback. Works in conjunction with the dmerror script
> > +   in tools/ to write data to a device, and then force it to fail
> > +   writeback and test that errors are reported during fsync and cleared
> > +   afterward.
> > +
> >  open_by_handle
> >  
> > The open_by_handle program exercises the open_by_handle_at() system
> > diff --git a/src/Makefile b/src/Makefile
> > index 4ec01975f8f7..b79c4d84d31b 100644
> > --- a/src/Makefile
> > +++ b/src/Makefile
> > @@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
> > multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
> > t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
> > holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
> > -   t_mmap_cow_race
> > +   t_mmap_cow_race fsync-err
> >  
> >  LINUX_TARGETS = xfsctl bstat t_mta

[xfstests PATCH v3 0/5] add a test for reporting writeback errors across all fds on fsync

2017-05-31 Thread Jeff Layton
This patchset is a companion to the Linux kernel patch series I recently
posted with the cover letter:

[PATCH v5 00/17] fs: introduce new writeback error reporting and convert 
ext2 and ext4 to use it

That patchset adds a new userland-visible change to report errors on
all open file descriptions when there is an error on fsync, not just
the first one to race in.

Note that this set contains a patch to emulate $SCRATCH_LOGDEV on btrfs,
but the kernel patches for that are not quite ready yet. The test did
pass on btrfs in an earlier incarnation of the set, however.

Jeff Layton (5):
  generic: add a writeback error handling test
  ext4: allow ext4 to use $SCRATCH_LOGDEV
  generic: test writeback error handling on dmerror devices
  ext3: allow it to put journal on a separate device when doing
scratch_mkfs
  btrfs: allow it to use $SCRATCH_LOGDEV

 common/dmerror |  13 ++--
 common/rc  |  16 -
 doc/auxiliary-programs.txt |   8 +++
 src/Makefile   |   2 +-
 src/fsync-err.c| 161 +
 tests/generic/998  |  64 ++
 tests/generic/998.out  |   2 +
 tests/generic/999  |  76 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   2 +
 tools/dmerror  |  44 +
 11 files changed, 384 insertions(+), 7 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v3 2/5] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-05-31 Thread Jeff Layton
The writeback error handling test requires that you put the journal on a
separate device. This allows us to use dmerror to simulate data
writeback failure, without affecting the journal.

xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
up the ext4 code so that it can do the same thing when _scratch_mkfs is
called.

Signed-off-by: Jeff Layton 
Reviewed-by: Darrick J. Wong 
---
 common/rc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/common/rc b/common/rc
index 743df427c047..391d36f373cd 100644
--- a/common/rc
+++ b/common/rc
@@ -676,6 +676,9 @@ _scratch_mkfs_ext4()
local tmp=`mktemp`
local mkfs_status
 
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
 
_scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
1>$tmp.mkfsstd
mkfs_status=$?
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v3 4/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs

2017-05-31 Thread Jeff Layton
Signed-off-by: Jeff Layton 
---
 common/rc | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index 391d36f373cd..83765aacfb06 100644
--- a/common/rc
+++ b/common/rc
@@ -832,7 +832,16 @@ _scratch_mkfs()
mkfs_cmd="$MKFS_BTRFS_PROG"
mkfs_filter="cat"
;;
-   ext2|ext3)
+   ext3)
+   mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
+   mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
+
+   # put journal on separate device?
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
+   ;;
+   ext2)
mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
;;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v3 3/5] generic: test writeback error handling on dmerror devices

2017-05-31 Thread Jeff Layton
Ensure that we get an error back on all fds when a block device is
open by multiple writers and writeback fails.

Signed-off-by: Jeff Layton 
---
 tests/generic/998 | 64 +++
 tests/generic/998.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 67 insertions(+)
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out

diff --git a/tests/generic/998 b/tests/generic/998
new file mode 100755
index ..fbadb47507c2
--- /dev/null
+++ b/tests/generic/998
@@ -0,0 +1,64 @@
+#! /bin/bash
+# FS QA Test No. 998
+#
+# Test writeback error handling when writing to block devices via pagecache.
+# See src/fsync-err.c for details of what test actually does.
+#
+#---
+# Copyright (c) 2017, Jeff Layton 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_require_scratch
+_require_logdev
+_require_dm_target error
+_require_test_program fsync-err
+
+rm -f $seqres.full
+
+$XFS_IO_PROG -d -c "pwrite -S 0x7c -b 1048576 0 $((64 * 1048576))" 
$SCRATCH_DEV >> $seqres.full
+_dmerror_init
+
+$here/src/fsync-err $DMERROR_DEV
+
+# success, all done
+_dmerror_load_working_table
+_dmerror_cleanup
+_scratch_mkfs > $seqres.full 2>&1
+status=0
+exit
diff --git a/tests/generic/998.out b/tests/generic/998.out
new file mode 100644
index ..658c438820e2
--- /dev/null
+++ b/tests/generic/998.out
@@ -0,0 +1,2 @@
+QA output created by 998
+Test passed!
diff --git a/tests/generic/group b/tests/generic/group
index 39f7b14657f1..9fc384363ca7 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -440,4 +440,5 @@
 435 auto encrypt
 436 auto quick rw
 437 auto quick
+998 auto quick
 999 auto quick
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v3 5/5] btrfs: allow it to use $SCRATCH_LOGDEV

2017-05-31 Thread Jeff Layton
With btrfs, we can't really put the log on a separate device. What we
can do however is mirror the metadata across two devices and make the
data striped across all devices. When we turn on dmerror then the
metadata can fall back to using the other mirror while the data errors
out.

Note that the current incarnation of btrfs has a fixed 64k stripe
width. If that ever changes or becomes settable, we may need to adjust
the amount of data that the test program writes.

Signed-off-by: Jeff Layton 
---
 common/rc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/common/rc b/common/rc
index 83765aacfb06..078270451b53 100644
--- a/common/rc
+++ b/common/rc
@@ -830,6 +830,8 @@ _scratch_mkfs()
;;
btrfs)
mkfs_cmd="$MKFS_BTRFS_PROG"
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   mkfs_cmd="$mkfs_cmd -d raid0 -m raid1 $SCRATCH_LOGDEV"
mkfs_filter="cat"
;;
ext3)
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-05-31 Thread Jeff Layton
I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, I aim
to make the kernel report them once on every file description.

This patch adds a test for the new behavior. Basically, open many fds
to the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back.

To do that, I'm adding a new tools/dmerror script that the C program
can use to load the error table. For now, that's all it can do, but
we can fill it out with other commands as necessary.

Signed-off-by: Jeff Layton 
---
 common/dmerror |  13 ++--
 doc/auxiliary-programs.txt |   8 +++
 src/Makefile   |   2 +-
 src/fsync-err.c| 161 +
 tests/generic/999  |  76 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   1 +
 tools/dmerror  |  44 +
 8 files changed, 302 insertions(+), 6 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
index 21ef118596b6..191ac0596511 100644
--- a/doc/auxiliary-programs.txt
+++ b/doc/auxiliary-programs.txt
@@ -16,6 +16,7 @@ note the dependency with:
 Contents:
 
  - af_unix -- Create an AF_UNIX socket
+ - fsync-err   -- tests fsync error reporting after failed writeback
  - open_by_handle  -- open_by_handle_at syscall exercise
  - stat_test   -- statx syscall exercise
  - t_dir_type  -- print directory entries and their file type
@@ -30,6 +31,13 @@ af_unix
 
The af_unix program creates an AF_UNIX socket at the given location.
 
+fsync-err
+   Specialized program for testing how the kernel reports errors that
+   occur during writeback. Works in conjunction with the dmerror script
+   in tools/ to write data to a device, and then force it to fail
+   writeback and test that errors are reported during fsync and cleared
+   afterward.
+
 open_by_handle
 
The open_by_handle program exercises the open_by_handle_at() system
diff --git a/src/Makefile b/src/Makefile
index 4ec01975f8f7..b79c4d84d31b 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
-   t_mmap_cow_race
+   t_mmap_cow_race fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/fsync-err.c b/src/fsync-err.c
new file mode 100644
index ..cbeb37fb1790
--- /dev/null
+++ b/src/fsync-err.c
@@ -0,0 +1,161 @@
+/*
+ * fsync-err.c: test whether writeback errors are reported to all open fds
+ *         and properly cleared as expected after being seen once on each
+ *
+ * Copyright (c) 2017: Jeff Layton 
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * btrfs has a fixed stripewidth of 64k, so we need to write enough data to
+ * ensure that we hit both stripes.
+ *
+ * FIXME: have the test script pass in the length?
+ */
+#define BUFSIZE (65 * 1024)
+
+/* FIXME: should this be tunable */
+#define NUM_FDS10
+
+static void usage() {
+   fprintf(stderr, "Usage: fsync-err \n");
+}
+
+int main(int argc, char **argv)
+{
+   int fd[NUM_FDS], ret, i;
+   char *fname, *buf;
+
+   if (argc < 1) {
+   usage();
+   ret

[PATCH v5 01/17] lib: add errseq_t type and infrastructure for handling it

2017-05-31 Thread Jeff Layton
An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton 
Reviewed-by: NeilBrown 
---
 include/linux/errseq.h |  19 +
 lib/Makefile   |   2 +-
 lib/errseq.c   | 200 +
 3 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index ..0d2555f310cd
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+
+/* See lib/errseq.c for more info */
+
+typedef u32errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+   /* Optimize for the common case of no error */
+   if (unlikely(err))
+   __errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 0166fbc0fa81..519782d9ca3f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o refcount.o usercopy.o
+once.o refcount.o usercopy.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index ..d129c0611c1f
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,200 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since a previous
+ * point where it was sampled.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value. That value
+ * can later be used to tell whether any new errors have occurred since that
+ * sampling was done.
+ *
+ * Note that there is a risk of collisions if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT   ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT)
+
+/* The lowest bit of the counter */
+#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Most callers will want to use the errseq_set

[PATCH v5 02/17] fs: new infrastructure for writeback error handling and reporting

2017-05-31 Thread Jeff Layton
Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
---
 drivers/dax/device.c |  1 +
 fs/block_dev.c   |  1 +
 fs/file_table.c  |  1 +
 fs/open.c|  3 +++
 include/linux/fs.h   | 53 
 mm/filemap.c | 38 +
 6 files changed, 97 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 006e657dfcb9..12943d19bfc4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
inode->i_mapping = __dax_inode->i_mapping;
inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..4d62fe771587 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
return -ENOMEM;
 
filp->f_mapping = bdev->bd_inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 
return blkdev_get(bdev, filp->f_mode, filp);
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 954d510b765a..72e861a35a7f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
mode,
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
+   file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
if ((mode & FMODE_READ) &&
 likely(fop->read || fop->read_iter))
mode |= FMODE_CAN_READ;
diff --git 

[PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton
v5: don't retrofit old API over the new infrastructure
add fstype flag to indicate how wb errors are tracked within that fs
add more function variants that take a errseq_t "since" value
add second errseq_t to struct file to track metadata wb errors
convert ext4 and ext2 to use the new APIs

v4: several more cleanup patches
documentation and kerneldoc comment updates
fix bugs in gfs2 patches
make sync_file_range use same error reporting semantics
bugfixes in buffer.c
convert nfs to new scheme (maybe bogus, can be dropped)

v3: wb_err_t -> errseq_t conversion
clean up places that re-set errors after calling filemap_* functions

v2: introduce wb_err_t, use atomics

This is v5 of the patchset to improve how we're tracking and reporting
errors that occur during pagecache writeback. The main difference in
this set from the last one is that I've stopped trying to retrofit the
old error tracking API on top of the new one. This is more work since
we'll have to touch each fs individually, but should be safer as the
"since" values used for checking errors will be more deliberate.

There are several situations where the kernel can "lose" errors that
occur during writeback, such that fsync will return success even
though it failed to write back some data previously. The basic idea
here is to have the kernel be more deliberate about the point from
which errors are checked to ensure that that doesn't happen.

An additional aim of this set is to change the behavior of fsync in
Linux to report writeback errors on all fds instead of just the first
one. This allows writers to reliably tell whether their data made it to
the backing device without having to coordinate fsync calls with other
writers.

To do this, we add a new typedef: errseq_t. This is a 32-bit value
that can store an error code, and a sequence number so we can tell
whether it has changed since we last sampled it. This allows us to
record errors in the address_space and then report those errors only
once per file description.

This set just alters block device files, ext4 and the legacy ext2
driver. If this general approach seems acceptable, then I'll start
converting other filesystems in follow-on patchsets. I'd also like
to get this into linux-next as soon as possible to ensure that we're
banging out any bugs that might be lurking here.

I also have a couple of xfstests for this as well that I'll re-post
soon.

Jeff Layton (17):
  lib: add errseq_t type and infrastructure for handling it
  fs: new infrastructure for writeback error handling and reporting
  mm: tracepoints for writeback error events
  fs: add a new fstype flag to indicate how writeback errors are tracked
  Documentation: flesh out the section in vfs.txt on storing and
reporting writeback errors
  fs: adapt sync_file_range to new reporting infrastructure
  mm: add filemap_fdatawait_range_since and
filemap_write_and_wait_range_since
  dax: set errors in mapping when writeback fails
  block: convert to errseq_t based writeback error tracking
  block: add sync_blockdev_since and sync_filesystem_since
  fs: add f_md_wb_err field to struct file for tracking metadata errors
  fs: allow __generic_file_fsync to support both flavors of error
reporting
  jbd2: conditionally handle errors using errseq_t based on FS_WB_ERRSEQ
flag
  ext4: convert to errseq_t based error tracking
  fs: add a write_one_page_since
  ext2: convert to errseq_t based writeback error tracking
  fs: convert ext2 to use write_one_page_since

 Documentation/filesystems/vfs.txt |  50 -
 drivers/dax/device.c  |   1 +
 fs/block_dev.c|  29 +-
 fs/dax.c  |  18 +++-
 fs/ext2/dir.c |  25 +++--
 fs/ext2/file.c|  29 --
 fs/ext2/super.c   |   2 +-
 fs/ext4/dir.c |   8 +-
 fs/ext4/ext4.h|   8 +-
 fs/ext4/extents.c |  24 +++--
 fs/ext4/file.c|   5 +-
 fs/ext4/fsync.c   |  23 -
 fs/ext4/inode.c   |  19 ++--
 fs/ext4/ioctl.c   |   9 +-
 fs/ext4/super.c   |   9 +-
 fs/file_table.c   |   1 +
 fs/internal.h |   8 ++
 fs/jbd2/commit.c  |  29 --
 fs/jbd2/recovery.c|   5 +-
 fs/jbd2/transaction.c |   1 +
 fs/libfs.c|  26 +++--
 fs/open.c |   3 +
 fs/sync.c |  62 +++-
 include/linux/errseq.h|  19 
 include/linux/fs.h|  82 ++-
 include/linux/jbd2.h  |   3 +
 include/linux/mm.h|   2 +
 include/linux/pagemap.h   |  32 --
 include/trace/events/filemap.h|  52 ++
 lib/M

[PATCH v5 04/17] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-05-31 Thread Jeff Layton
Now that we have new infrastructure for handling writeback errors using
errseq_t, we need to convert the existing code to use it. We could
attempt to retrofit the old interfaces on top of the new, but there is
a conceptual disconnect here in the case of internal callers that
invoke filemap_fdatawait and the like.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can report the
latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

This means that we need to touch each filesystem that calls
filemap_check_errors in some fashion and ensure that we establish sane
"since" values for those callers. But...some code is shared between
filesystems and needs to be able to handle both error tracking schemes.

Add a new FS_WB_ERRSEQ flag to the fstype. When mapping_set_error is
called, set mapping->wb_err if it's set, along with setting the
"legacy" AS_EIO/AS_ENOSPC flags. When calling filemap_report_wb_err,
always clear the legacy flags out as well.

This should allow subsystems to use the new errseq_t based error
reporting while simultaneously allowing the traditional semantics of
AS_EIO/AS_ENOSPC flags.

Eventually, this flag should be removed once everything is converted
to errseq_t based error tracking.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h  |  1 +
 include/linux/pagemap.h | 32 ++--
 mm/filemap.c|  7 +++
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 293cbc7f3520..2f3bcf4eb73b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2021,6 +2021,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_WB_ERRSEQ   16  /* errseq_t writeback err tracking */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 316a19f6b635..1dbc2dd6fdd2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -28,14 +28,34 @@ enum mapping_flags {
AS_NO_WRITEBACK_TAGS = 5,
 };
 
+/**
+ * mapping_set_error - record a writeback error in the address_space
+ * @mapping - the mapping in which an error should be set
+ * @error - the error to set in the mapping
+ *
+ * When writeback fails in some way, we must record that error so that
+ * userspace can be informed when fsync and the like are called.  We endeavor
+ * to report errors on any file that was open at the time of the error.  Some
+ * internal callers also need to know when writeback errors have occurred.
+ *
+ * When a writeback error occurs, most filesystems will want to call
+ * mapping_set_error to record the error in the mapping so that it can be
+ * reported when the application calls fsync(2).
+ */
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-   if (unlikely(error)) {
-   if (error == -ENOSPC)
-   set_bit(AS_ENOSPC, &mapping->flags);
-   else
-   set_bit(AS_EIO, &mapping->flags);
-   }
+   if (likely(!error))
+   return;
+
+   /* Record it in wb_err if fs is using errseq_t based error tracking */
+   if (mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ)
+   filemap_set_wb_err(mapping, error);
+
+   /* Unconditionally record it in flags for now, for legacy callers */
+   if (error == -ENOSPC)
+   set_bit(AS_ENOSPC, &mapping->flags);
+   else
+   set_bit(AS_EIO, &mapping->flags);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e19ea0bf12..97dc28f853fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -580,6 +580,13 @@ int filemap_report_wb_err(struct file *file)
trace_filemap_report_wb_err(file, old);
spin_unlock(&file->f_lock);
}
+
+   /* Now clear the AS_* flags if any are set */
+   if (test_bit(AS_ENOSPC, &mapping->flags))
+   clear_bit(AS_ENOSPC, &mapping->flags);
+   if (test_bit(AS_EIO, &mapping->flags))
+   clear_bit(AS_EIO, &mapping->flags);
+
return err;
 }
 EXPORT_SYMBOL(filemap_report_wb_err);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 03/17] mm: tracepoints for writeback error events

2017-05-31 Thread Jeff Layton
To enable that, make __errseq_set return the value that it was set to
we exit the loop. Take heed that that value is not suitable as a later
"since" value, as it will not have been marked seen.

Signed-off-by: Jeff Layton 
---
 include/linux/errseq.h |  2 +-
 include/linux/fs.h |  5 +++-
 include/trace/events/filemap.h | 55 ++
 lib/errseq.c   | 20 ++-
 mm/filemap.c   | 13 +-
 5 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index 0d2555f310cd..9e0d444ac88d 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -5,7 +5,7 @@
 
 typedef u32errseq_t;
 
-void __errseq_set(errseq_t *eseq, int err);
+errseq_t __errseq_set(errseq_t *eseq, int err);
 static inline void errseq_set(errseq_t *eseq, int err)
 {
/* Optimize for the common case of no error */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 24178107379d..293cbc7f3520 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2528,6 +2528,7 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
@@ -2547,7 +2548,9 @@ extern int __must_check filemap_report_wb_err(struct file 
*file);
  */
 static inline void filemap_set_wb_err(struct address_space *mapping, int err)
 {
-   errseq_set(&mapping->wb_err, err);
+   /* Fastpath for common case of no error */
+   if (unlikely(err))
+   __filemap_set_wb_err(mapping, err);
 }
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 42febb6bc1d5..2af66920f267 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 
@@ -52,6 +53,60 @@ DEFINE_EVENT(mm_filemap_op_page_cache, 
mm_filemap_add_to_page_cache,
TP_ARGS(page)
);
 
+TRACE_EVENT(filemap_set_wb_err,
+   TP_PROTO(struct address_space *mapping, errseq_t eseq),
+
+   TP_ARGS(mapping, eseq),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, errseq)
+   ),
+
+   TP_fast_assign(
+   __entry->i_ino = mapping->host->i_ino;
+   __entry->errseq = eseq;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = mapping->host->i_rdev;
+   ),
+
+   TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->errseq)
+);
+
+TRACE_EVENT(filemap_report_wb_err,
+   TP_PROTO(struct file *file, errseq_t old),
+
+   TP_ARGS(file, old),
+
+   TP_STRUCT__entry(
+   __field(struct file *, file);
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, old)
+   __field(errseq_t, new)
+   ),
+
+   TP_fast_assign(
+   __entry->file = file;
+   __entry->i_ino = file->f_mapping->host->i_ino;
+   if (file->f_mapping->host->i_sb)
+   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->old = old;
+   __entry->new = file->f_wb_err;
+   ),
+
+   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   __entry->file, MAJOR(__entry->s_dev),
+   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
+   __entry->new)
+);
 #endif /* _TRACE_FILEMAP_H */
 
 /* This part must be outside protection */
diff --git a/lib/errseq.c b/lib/errseq.c
index d129c0611c1f..009972d3000c 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -52,10 +52,14 @@
  *
  * Most callers will want to use the errseq_set inline wrapper to efficiently
  * handle the common case where err is 0.
+ *
+ * We do return an errseq_t here, primarily for debugging purposes. The r

[PATCH v5 05/17] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-05-31 Thread Jeff Layton
I waxed a little loquacious here, but I figured that more detail was
better, and writeback error handling is so hard to get right.

Although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara 
Signed-off-by: Jeff Layton 
---
 Documentation/filesystems/vfs.txt | 50 ---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..c3efdd833a3d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,49 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure should call
+filemap_set_wb_err to record the error in the address_space when it
+occurs.  Then, at the end of their fsync operation, they should call
+filemap_report_wb_err to ensure that the struct file's error cursor
+has advanced to the correct point in the stream of errors emitted by
+the backing device(s).
+
+Older kernels used a different method for tracking errors, based on flags
+in the address_space. We're currently switching everything over to use
+the infrastructure based on errseq_t values. During the transition,
+filesystem authors will want to also ensure their file_system_type has
+FS_WB_ERRSEQ set in fs_flags to ensure that shared infrastructure is
+aware of the model in use.
 
 struct address_space_operations
 ---
@@ -804,7 +846,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +930,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 10/17] block: add sync_blockdev_since and sync_filesystem_since

2017-05-31 Thread Jeff Layton
New variants of sync_filesystem and sync_blockdev.

Signed-off-by: Jeff Layton 
---
 fs/block_dev.c | 15 +++
 fs/internal.h  |  8 
 fs/sync.c  | 45 +
 include/linux/fs.h | 13 -
 4 files changed, 80 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0d5f849e2a18..9da613ec1665 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -452,6 +452,15 @@ int __sync_blockdev(struct block_device *bdev, int wait)
return filemap_write_and_wait(bdev->bd_inode->i_mapping);
 }
 
+int __sync_blockdev_since(struct block_device *bdev, int wait, errseq_t since)
+{
+   if (!bdev)
+   return 0;
+   if (!wait)
+   return filemap_flush(bdev->bd_inode->i_mapping);
+   return filemap_write_and_wait_since(bdev->bd_inode->i_mapping, since);
+}
+
 /*
  * Write out and wait upon all the dirty data associated with a block
  * device via its mapping.  Does not take the superblock lock.
@@ -462,6 +471,12 @@ int sync_blockdev(struct block_device *bdev)
 }
 EXPORT_SYMBOL(sync_blockdev);
 
+int sync_blockdev_since(struct block_device *bdev, errseq_t since)
+{
+   return __sync_blockdev_since(bdev, 1, since);
+}
+EXPORT_SYMBOL(sync_blockdev_since);
+
 /*
  * Write out and wait upon all dirty data associated with this
  * device.   Filesystem data as well as the underlying block
diff --git a/fs/internal.h b/fs/internal.h
index 9676fe11c093..234343ba8af7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -25,6 +25,8 @@ struct shrink_control;
 extern void __init bdev_cache_init(void);
 
 extern int __sync_blockdev(struct block_device *bdev, int wait);
+extern int __sync_blockdev_since(struct block_device *bdev, int wait,
+   errseq_t since);
 
 #else
 static inline void bdev_cache_init(void)
@@ -35,6 +37,12 @@ static inline int __sync_blockdev(struct block_device *bdev, 
int wait)
 {
return 0;
 }
+
+static inline int __sync_blockdev_since(struct block_device *bdev, int wait,
+   errseq_t since)
+{
+   return 0;
+}
 #endif
 
 /*
diff --git a/fs/sync.c b/fs/sync.c
index 819a81526714..2a8202f9eb21 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -71,6 +71,51 @@ int sync_filesystem(struct super_block *sb)
 }
 EXPORT_SYMBOL(sync_filesystem);
 
+static int __sync_filesystem_since(struct super_block *sb, int wait,
+   errseq_t since)
+{
+   int fs_ret = 0, bd_ret;
+
+   if (wait)
+   sync_inodes_sb(sb);
+   else
+   writeback_inodes_sb(sb, WB_REASON_SYNC);
+
+   if (sb->s_op->sync_fs)
+   fs_ret = sb->s_op->sync_fs(sb, wait);
+   bd_ret = __sync_blockdev_since(sb->s_bdev, wait, since);
+
+   return fs_ret ? fs_ret : bd_ret;
+}
+
+/*
+ * Write out and wait upon all dirty data associated with this
+ * superblock.  Filesystem data as well as the underlying block
+ * device.  Takes the superblock lock.
+ */
+int sync_filesystem_since(struct super_block *sb, errseq_t since)
+{
+   int ret;
+
+   /*
+* We need to be protected against the filesystem going from
+* r/o to r/w or vice versa.
+*/
+   WARN_ON(!rwsem_is_locked(&sb->s_umount));
+
+   /*
+* No point in syncing out anything if the filesystem is read-only.
+*/
+   if (sb->s_flags & MS_RDONLY)
+   return 0;
+
+   ret = __sync_filesystem_since(sb, 0, since);
+   if (ret < 0)
+   return ret;
+   return __sync_filesystem_since(sb, 1, since);
+}
+EXPORT_SYMBOL(sync_filesystem_since);
+
 static void sync_inodes_one_sb(struct super_block *sb, void *arg)
 {
if (!(sb->s_flags & MS_RDONLY))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7d1bd3163d99..f483c23866c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2376,6 +2376,7 @@ extern void bdput(struct block_device *);
 extern void invalidate_bdev(struct block_device *);
 extern void iterate_bdevs(void (*)(struct block_device *, void *), void *);
 extern int sync_blockdev(struct block_device *bdev);
+extern int sync_blockdev_since(struct block_device *bdev, errseq_t since);
 extern void kill_bdev(struct block_device *);
 extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
@@ -2390,7 +2391,16 @@ static inline bool sb_is_blkdev_sb(struct super_block 
*sb)
 }
 #else
 static inline void bd_forget(struct inode *inode) {}
-static inline int sync_blockdev(struct block_device *bdev) { return 0; }
+static inline int sync_blockdev(struct block_device *bdev)
+{
+   return 0;
+}
+
+static inline int sync_blockdev_since(struct block_device *bdev,
+   errseq_t since)
+{
+   return 0;
+}
 static inline void kill_bdev(struct block_device *bdev) {}
 sta

[PATCH v5 08/17] dax: set errors in mapping when writeback fails

2017-05-31 Thread Jeff Layton
Jan's description for this patch is much better than mine, so I'm
quoting it verbatim here:

-8<-
DAX currently doesn't set errors in the mapping when cache flushing
fails in dax_writeback_mapping_range(). Since this function can get
called only from fsync(2) or sync(2), this is actually as good as it can
currently get since we correctly propagate the error up from
dax_writeback_mapping_range() to filemap_fdatawrite()

However, in the future better writeback error handling will enable us to
properly report these errors on fsync(2) even if there are multiple file
descriptors open against the file or if sync(2) gets called before
fsync(2). So convert DAX to using standard error reporting through the
mapping.
-8<-

For now, only do this when the FS_WB_ERRSEQ flag is set. The
AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when
writeback initiation fails, only when we discover an error after waiting
on writeback to complete, so we only want to do this with errseq_t based
error handling to prevent seeing duplicate errors on fsync.

Signed-off-by: Jeff Layton 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Reviewed-and-Tested-by: Ross Zwisler 
---
 fs/dax.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index c22eaf162f95..42788d8505c7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -856,8 +856,24 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 
ret = dax_writeback_one(bdev, dax_dev, mapping,
indices[i], pvec.pages[i]);
-   if (ret < 0)
+   if (ret < 0) {
+   /*
+* For fs' that use errseq_t based error
+* tracking, we must call mapping_set_error
+* here to ensure that fsync on all open fds
+* get back an error. Doing this with the old
+* wb error tracking infrastructure is
+* problematic though, as DAX writeback is
+* synchronous, and the error flags are not
+* cleared when initiation fails, only when
+* it fails after the write has been submitted
+* to the backing store.
+*/
+   if (mapping->host->i_sb->s_type->fs_flags &
+   FS_WB_ERRSEQ)
+   mapping_set_error(mapping, ret);
goto out;
+   }
}
}
 out:
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 06/17] fs: adapt sync_file_range to new reporting infrastructure

2017-05-31 Thread Jeff Layton
Since it returns errors in a way similar to fsync, have it use the same
method for returning previously-reported writeback errors.

Signed-off-by: Jeff Layton 
---
 fs/sync.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index ec93aac4feb9..819a81526714 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -275,8 +275,11 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
  *
  *
  * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
- * I/O errors or ENOSPC conditions and will return those to the caller, after
- * clearing the EIO and ENOSPC flags in the address_space.
+ * error condition that occurred prior to or after writeback, and will return
+ * that to the caller, while advancing the file's errseq_t cursor. Note that
+ * any errors returned here may have occurred in an area of the file that is
+ * not covered by the given range as most filesystems track writeback errors
+ * on a per-address_space basis
  *
  * It should be noted that none of these operations write out the file's
  * metadata.  So unless the application is strictly performing overwrites of
@@ -343,19 +346,25 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, 
loff_t, nbytes,
if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
ret = filemap_fdatawait_range(mapping, offset, endbyte);
if (ret < 0)
-   goto out_put;
+   goto out_report;
}
 
if (flags & SYNC_FILE_RANGE_WRITE) {
ret = __filemap_fdatawrite_range(mapping, offset, endbyte,
 WB_SYNC_NONE);
if (ret < 0)
-   goto out_put;
+   goto out_report;
}
 
if (flags & SYNC_FILE_RANGE_WAIT_AFTER)
ret = filemap_fdatawait_range(mapping, offset, endbyte);
 
+out_report:
+   if (mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ) {
+   int ret2 = filemap_report_wb_err(f.file);
+   if (!ret)
+   ret = ret2;
+   }
 out_put:
fdput(f);
 out:
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 09/17] block: convert to errseq_t based writeback error tracking

2017-05-31 Thread Jeff Layton
Fairly straightforward conversion. In fsync, just use the file->f_wb_err
value as a "since" value. At the end, call filemap_report_wb_err to
advance the cursor in it.

Signed-off-by: Jeff Layton 
---
 fs/block_dev.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4d62fe771587..0d5f849e2a18 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -622,11 +622,13 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t 
end, int datasync)
 {
struct inode *bd_inode = bdev_file_inode(filp);
struct block_device *bdev = I_BDEV(bd_inode);
-   int error;
+   int error, wberr;
+   errseq_t since = READ_ONCE(filp->f_wb_err);

-   error = filemap_write_and_wait_range(filp->f_mapping, start, end);
+   error = filemap_write_and_wait_range_since(filp->f_mapping, start,
+   end, since);
if (error)
-   return error;
+   goto out;
 
/*
 * There is no need to serialise calls to blkdev_issue_flush with
@@ -637,6 +639,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t 
end, int datasync)
if (error == -EOPNOTSUPP)
error = 0;
 
+out:
+   wberr = filemap_report_wb_err(filp);
+   if (!error)
+   error = wberr;
return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);
@@ -801,6 +807,7 @@ static struct file_system_type bd_type = {
.name   = "bdev",
.mount  = bd_mount,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_WB_ERRSEQ,
 };
 
 struct super_block *blockdev_superblock __read_mostly;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 14/17] ext4: convert to errseq_t based error tracking

2017-05-31 Thread Jeff Layton
Sample the block device inode's errseq_t when opening a file, so we can
catch metadata writeback errors at fsync time. Change ext4_sync_file to
check for data errors first, and then check the blockdev for metadata
errors afterward.

There are also several internal callers of filemap_write_and_wait_* that
check the error code afterward. Convert them to the "_since" variants,
using the file->f_wb_err value as the "since" value. This means passing
file pointers to several functions instead of inode pointers.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors when there is a metadata writeback failure.

Signed-off-by: Jeff Layton 
---
 fs/ext4/dir.c |  8 ++--
 fs/ext4/ext4.h|  8 
 fs/ext4/extents.c | 24 ++--
 fs/ext4/file.c|  5 -
 fs/ext4/fsync.c   | 23 ++-
 fs/ext4/inode.c   | 19 ---
 fs/ext4/ioctl.c   |  9 +
 fs/ext4/super.c   |  9 +
 8 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8e8046104f4d..e3ab27db43d0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2468,12 +2468,12 @@ extern void ext4_clear_inode(struct inode *);
 extern int  ext4_file_getattr(const struct path *, struct kstat *, u32, 
unsigned int);
 extern int  ext4_sync_inode(handle_t *, struct inode *);
 extern void ext4_dirty_inode(struct inode *, int);
-extern int ext4_change_inode_journal_flag(struct inode *, int);
+extern int ext4_change_inode_journal_flag(struct file *, int);
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
-extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
+extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
 extern int ext4_alloc_da_blocks(struct inode *inode);
@@ -3143,8 +3143,8 @@ extern ext4_lblk_t ext4_ext_next_allocated_block(struct 
ext4_ext_path *path);
 extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
__u64 start, __u64 len);
 extern int ext4_ext_precache(struct inode *inode);
-extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len);
-extern int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len);
+extern int ext4_collapse_range(struct file *file, loff_t offset, loff_t len);
+extern int ext4_insert_range(struct file *file, loff_t offset, loff_t len);
 extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
struct inode *inode2, ext4_lblk_t lblk1,
 ext4_lblk_t lblk2,  ext4_lblk_t count,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2a97dff87b96..7e108fda9ae9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4934,17 +4934,17 @@ long ext4_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
return -EOPNOTSUPP;
 
if (mode & FALLOC_FL_PUNCH_HOLE)
-   return ext4_punch_hole(inode, offset, len);
+   return ext4_punch_hole(file, offset, len);
 
ret = ext4_convert_inline_data(inode);
if (ret)
return ret;
 
if (mode & FALLOC_FL_COLLAPSE_RANGE)
-   return ext4_collapse_range(inode, offset, len);
+   return ext4_collapse_range(file, offset, len);
 
if (mode & FALLOC_FL_INSERT_RANGE)
-   return ext4_insert_range(inode, offset, len);
+   return ext4_insert_range(file, offset, len);
 
if (mode & FALLOC_FL_ZERO_RANGE)
return ext4_zero_range(file, offset, len, mode);
@@ -5444,14 +5444,16 @@ ext4_ext_shift_extents(struct inode *inode, handle_t 
*handle,
  * This implements the fallocate's collapse range functionality for ext4
  * Returns: 0 and non-zero on er

[PATCH v5 12/17] fs: allow __generic_file_fsync to support both flavors of error reporting

2017-05-31 Thread Jeff Layton
For now, we add a FS_WB_ERRSEQ check to know how to handle it.

Signed-off-by: Jeff Layton 
---
 fs/libfs.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 1dec90819366..2ae58a252718 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -971,10 +971,18 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 int datasync)
 {
struct inode *inode = file->f_mapping->host;
-   int err;
-   int ret;
-
-   err = filemap_write_and_wait_range(inode->i_mapping, start, end);
+   int err, ret;
+   bool use_errseq = inode->i_sb->s_type->fs_flags & FS_WB_ERRSEQ;
+   errseq_t since;
+
+   if (use_errseq) {
+   since = READ_ONCE(file->f_wb_err);
+   err = filemap_write_and_wait_range_since(inode->i_mapping,
+   start, end, since);
+   } else {
+   err = filemap_write_and_wait_range(inode->i_mapping,
+   start, end);
+   }
if (err)
return err;
 
@@ -988,11 +996,15 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
err = sync_inode_metadata(inode, 1);
if (ret == 0)
ret = err;
-
 out:
inode_unlock(inode);
-   err = filemap_check_errors(inode->i_mapping);
-   return ret ? ret : err;
+   if (ret == 0) {
+   if (use_errseq)
+   err = filemap_check_wb_err(inode->i_mapping, since);
+   else
+   err = filemap_check_errors(inode->i_mapping);
+   }
+   return ret;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 15/17] fs: add a write_one_page_since

2017-05-31 Thread Jeff Layton
Allow filesystems to pass in an errseq_t for a since value.

Signed-off-by: Jeff Layton 
---
 include/linux/mm.h  |  2 ++
 mm/page-writeback.c | 53 +
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca9c8b27cecb..c901d7313374 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct mempolicy;
 struct anon_vma;
@@ -2200,6 +2201,7 @@ extern int filemap_page_mkwrite(struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
 int __must_check write_one_page(struct page *page);
+int __must_check write_one_page_since(struct page *page, errseq_t since);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e369e8ea2a29..63058e35c60d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2365,19 +2365,10 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
return ret;
 }
 
-/**
- * write_one_page - write out a single page and wait on I/O
- * @page: the page to write
- *
- * The page must be locked by the caller and will be unlocked upon return.
- *
- * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
- * function returns.
- */
-int write_one_page(struct page *page)
+static int __write_one_page(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   int ret = 0, ret2;
+   int ret;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 1,
@@ -2394,16 +2385,54 @@ int write_one_page(struct page *page)
wait_on_page_writeback(page);
put_page(page);
} else {
+   ret = 0;
unlock_page(page);
}
+   return ret;
+}
 
+/**
+ * write_one_page - write out a single page and wait on I/O
+ * @page: the page to write
+ *
+ * The page must be locked by the caller and will be unlocked upon return.
+ *
+ * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
+ * function returns.
+ */
+int write_one_page(struct page *page)
+{
+   int ret;
+
+   ret = __write_one_page(page);
if (!ret)
-   ret = filemap_check_errors(mapping);
+   ret = filemap_check_errors(page->mapping);
return ret;
 }
 EXPORT_SYMBOL(write_one_page);
 
 /*
+ * write_one_page_since - write out a single page and wait on I/O
+ * @page: the page to write
+ * @since: previously sampled errseq_t
+ *
+ * The page must be locked by the caller and will be unlocked upon return.
+ *
+ * The caller should pass in a previously-sampled errseq_t. The mapping will
+ * be checked for errors since that point.
+ */
+int write_one_page_since(struct page *page, errseq_t since)
+{
+   int ret;
+
+   ret = __write_one_page(page);
+   if (!ret)
+   ret = filemap_check_wb_err(page->mapping, since);
+   return ret;
+}
+EXPORT_SYMBOL(write_one_page_since);
+
+/*
  * For address_spaces which do not use buffers nor write back.
  */
 int __set_page_dirty_no_writeback(struct page *page)
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 13/17] jbd2: conditionally handle errors using errseq_t based on FS_WB_ERRSEQ flag

2017-05-31 Thread Jeff Layton
Grab the current mapping->wb_err when linking a transaction to the list
and stash it in the journal inode. Then we can use that as a "since"
value when committing it to ensure that there were no writeback errors
since the transaction was started.

We do still need to perform old-style error handling too for now in
journal_finish_inode_data_buffers. jbd2 is shared infrastructure between
several filesystems. Eventually we should be able to remove the flag check
and simplify this function again.

For journal recovery, sample the wb_err early on and then pass that as
the since value to sync_blockdev_since.

Signed-off-by: Jeff Layton 
---
 fs/jbd2/commit.c  | 29 +++--
 fs/jbd2/recovery.c|  5 +++--
 fs/jbd2/transaction.c |  1 +
 include/linux/jbd2.h  |  3 +++
 4 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b6b194ec1b4f..aea71e4bc9be 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -259,21 +259,30 @@ static int journal_finish_inode_data_buffers(journal_t 
*journal,
/* For locking, see the comment in journal_submit_data_buffers() */
spin_lock(&journal->j_list_lock);
list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+   struct inode *inode = jinode->i_vfs_inode;
+
if (!(jinode->i_flags & JI_WAIT_DATA))
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
-   err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
-   if (err) {
-   /*
-* Because AS_EIO is cleared by
-* filemap_fdatawait_range(), set it again so
-* that user process can get -EIO from fsync().
-*/
-   mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
-
-   if (!ret)
+   if (inode->i_sb->s_type->fs_flags & FS_WB_ERRSEQ) {
+   err = filemap_fdatawait_since(inode->i_mapping,
+   jinode->i_since);
+   if (err && !ret)
ret = err;
+   } else {
+   err = filemap_fdatawait(inode->i_mapping);
+   if (err) {
+   /*
+* Because AS_EIO is cleared by
+* filemap_fdatawait_range(), we must set it 
again so
+* that user process can get -EIO from fsync() 
if
+* non-errseq_t based error tracking is in play.
+*/
+   mapping_set_error(inode->i_mapping, -EIO);
+   if (!ret)
+   ret = err;
+   }
}
spin_lock(&journal->j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 02dd3360cb20..06a8ee71848c 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -248,11 +248,12 @@ int jbd2_journal_recover(journal_t *journal)
 {
int err, err2;
journal_superblock_t *  sb;
-
struct recovery_infoinfo;
+   errseq_tsince;
 
memset(&info, 0, sizeof(info));
sb = journal->j_superblock;
+   since = filemap_sample_wb_err(journal->j_fs_dev->bd_inode->i_mapping);
 
/*
 * The journal superblock's s_start field (the current log head)
@@ -284,7 +285,7 @@ int jbd2_journal_recover(journal_t *journal)
journal->j_transaction_sequence = ++info.end_transaction;
 
jbd2_journal_clear_revoke(journal);
-   err2 = sync_blockdev(journal->j_fs_dev);
+   err2 = sync_blockdev_since(journal->j_fs_dev, since);
if (!err)
err = err2;
/* Make sure all replayed data is on permanent storage */
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 9ee4832b6f8b..e9e6af20a087 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2535,6 +2535,7 @@ static int jbd2_journal_file_inode(handle_t *handle, 
struct jbd2_inode *jinode,
/* Not on any transaction list... */
J_ASSERT(!jinode->i_next_transaction);
jinode->i_transaction = transaction;
+   jinode->i_since = filemap_sample_wb_err(jinode->i_vfs_inode->i_mapping);
list_add(&jinode->i_list, &transaction->t_inode_list);
 done:
spin_unlock(&journal->j_list_lock);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..b6901eac2d8e 100

[PATCH v5 16/17] ext2: convert to errseq_t based writeback error tracking

2017-05-31 Thread Jeff Layton
Set the flag to indicate that we want new-style data writeback error
handling.

This means that we need to override the open routines for files and
directories so that we can sample the bdev wb_err at open.

XXX: doesn't quite pass the xfstest for this currently, as ext2_error
 resets the error on the device inode on every call.

Signed-off-by: Jeff Layton 
---
 fs/ext2/dir.c   |  8 
 fs/ext2/file.c  | 29 +++--
 fs/ext2/super.c |  2 +-
 3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index e2709695b177..6e476c9929f8 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -713,6 +713,13 @@ int ext2_empty_dir (struct inode * inode)
return 0;
 }
 
+static int ext2_dir_open(struct inode *inode, struct file *file)
+{
+   /* Sample blockdev mapping errseq_t for metadata writeback */
+   file->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return 0;
+}
+
 const struct file_operations ext2_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
@@ -721,5 +728,6 @@ const struct file_operations ext2_dir_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
+   .open   = ext2_dir_open,
.fsync  = ext2_fsync,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ed00e7ae0ef3..6f3cd7bc3fb3 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -172,16 +172,23 @@ static int ext2_release_file (struct inode * inode, 
struct file * filp)
 
 int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
-   int ret;
+   int ret, ret2;
struct super_block *sb = file->f_mapping->host->i_sb;
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO) {
-   /* We don't really know where the IO error happened... */
-   ext2_error(sb, __func__,
+
+   ret2 = filemap_report_wb_err(file);
+   if (ret == 0)
+   ret = ret2;
+
+   ret2 = filemap_report_md_wb_err(file, mapping);
+   if (ret2) {
+   if (ret == 0)
+   ret = ret2;
+   if (ret == -EIO)
+   ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
-   ret = -EIO;
}
return ret;
 }
@@ -204,6 +211,16 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
return generic_file_write_iter(iocb, from);
 }
 
+static int ext2_file_open(struct inode *inode, struct file *file)
+{
+   int ret;
+
+   ret = dquot_file_open(inode, file);
+   if (likely(ret == 0))
+   file->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
+}
+
 const struct file_operations ext2_file_operations = {
.llseek = generic_file_llseek,
.read_iter  = ext2_file_read_iter,
@@ -213,7 +230,7 @@ const struct file_operations ext2_file_operations = {
.compat_ioctl   = ext2_compat_ioctl,
 #endif
.mmap   = ext2_file_mmap,
-   .open   = dquot_file_open,
+   .open   = ext2_file_open,
.release= ext2_release_file,
.fsync  = ext2_fsync,
.get_unmapped_area = thp_get_unmapped_area,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 9c2028b50e5c..dd37d7f955bf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1629,7 +1629,7 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext2_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV|FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("ext2");
 
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 17/17] fs: convert ext2 to use write_one_page_since

2017-05-31 Thread Jeff Layton
Sample the wb_err before changing the directory, so that we can catch
errors that occur since that point.

Signed-off-by: Jeff Layton 
---
 fs/ext2/dir.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 6e476c9929f8..073f096ac5e6 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -85,7 +85,8 @@ ext2_last_byte(struct inode *inode, unsigned long page_nr)
return last_byte;
 }
 
-static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
+static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len,
+   errseq_t since)
 {
struct address_space *mapping = page->mapping;
struct inode *dir = mapping->host;
@@ -100,7 +101,7 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, 
unsigned len)
}
 
if (IS_DIRSYNC(dir)) {
-   err = write_one_page(page);
+   err = write_one_page_since(page, since);
if (!err)
err = sync_inode_metadata(dir, 1);
} else {
@@ -462,13 +463,14 @@ void ext2_set_link(struct inode *dir, struct 
ext2_dir_entry_2 *de,
(char *) de - (char *) page_address(page);
unsigned len = ext2_rec_len_from_disk(de->rec_len);
int err;
+   errseq_t since = filemap_sample_wb_err(dir->i_mapping);
 
lock_page(page);
err = ext2_prepare_chunk(page, pos, len);
BUG_ON(err);
de->inode = cpu_to_le32(inode->i_ino);
ext2_set_de_type(de, inode);
-   err = ext2_commit_chunk(page, pos, len);
+   err = ext2_commit_chunk(page, pos, len, since);
ext2_put_page(page);
if (update_times)
dir->i_mtime = dir->i_ctime = current_time(dir);
@@ -494,6 +496,7 @@ int ext2_add_link (struct dentry *dentry, struct inode 
*inode)
char *kaddr;
loff_t pos;
int err;
+   errseq_t since = filemap_sample_wb_err(dir->i_mapping);
 
/*
 * We take care of directory expansion in the same loop.
@@ -560,7 +563,7 @@ int ext2_add_link (struct dentry *dentry, struct inode 
*inode)
memcpy(de->name, name, namelen);
de->inode = cpu_to_le32(inode->i_ino);
ext2_set_de_type (de, inode);
-   err = ext2_commit_chunk(page, pos, rec_len);
+   err = ext2_commit_chunk(page, pos, rec_len, since);
dir->i_mtime = dir->i_ctime = current_time(dir);
EXT2_I(dir)->i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(dir);
@@ -589,6 +592,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, 
struct page * page )
ext2_dirent * pde = NULL;
ext2_dirent * de = (ext2_dirent *) (kaddr + from);
int err;
+   errseq_t since = filemap_sample_wb_err(inode->i_mapping);
 
while ((char*)de < (char*)dir) {
if (de->rec_len == 0) {
@@ -609,7 +613,7 @@ int ext2_delete_entry (struct ext2_dir_entry_2 * dir, 
struct page * page )
if (pde)
pde->rec_len = ext2_rec_len_to_disk(to - from);
dir->inode = 0;
-   err = ext2_commit_chunk(page, pos, to - from);
+   err = ext2_commit_chunk(page, pos, to - from, since);
inode->i_ctime = inode->i_mtime = current_time(inode);
EXT2_I(inode)->i_flags &= ~EXT2_BTREE_FL;
mark_inode_dirty(inode);
@@ -628,6 +632,7 @@ int ext2_make_empty(struct inode *inode, struct inode 
*parent)
struct ext2_dir_entry_2 * de;
int err;
void *kaddr;
+   errseq_t since = filemap_sample_wb_err(inode->i_mapping);
 
if (!page)
return -ENOMEM;
@@ -653,7 +658,7 @@ int ext2_make_empty(struct inode *inode, struct inode 
*parent)
memcpy (de->name, "..\0", 4);
ext2_set_de_type (de, inode);
kunmap_atomic(kaddr);
-   err = ext2_commit_chunk(page, 0, chunk_size);
+   err = ext2_commit_chunk(page, 0, chunk_size, since);
 fail:
put_page(page);
return err;
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 07/17] mm: add filemap_fdatawait_range_since and filemap_write_and_wait_range_since

2017-05-31 Thread Jeff Layton
Add new filemap_*wait* variants that take a "since" value and return an
error if one occurred since that sample point.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h |  9 
 mm/filemap.c   | 67 ++
 2 files changed, 76 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2f3bcf4eb73b..7d1bd3163d99 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2516,12 +2516,21 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
+extern int filemap_fdatawait_since(struct address_space *, errseq_t);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
+extern int filemap_fdatawait_range_since(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte,
+  errseq_t since);
 extern int filemap_write_and_wait(struct address_space *mapping);
+extern int filemap_write_and_wait_since(struct address_space *mapping,
+   errseq_t since);
 extern int filemap_write_and_wait_range(struct address_space *mapping,
loff_t lstart, loff_t lend);
+extern int filemap_write_and_wait_range_since(struct address_space *mapping,
+  loff_t start_byte, loff_t end_byte,
+  errseq_t since);
 extern int __filemap_fdatawrite_range(struct address_space *mapping,
loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 97dc28f853fc..38a14dc825ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -431,6 +431,14 @@ int filemap_fdatawait_range(struct address_space *mapping, 
loff_t start_byte,
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
+int filemap_fdatawait_range_since(struct address_space *mapping, loff_t 
start_byte,
+ loff_t end_byte, errseq_t since)
+{
+   __filemap_fdatawait_range(mapping, start_byte, end_byte);
+   return filemap_check_wb_err(mapping, since);
+}
+EXPORT_SYMBOL(filemap_fdatawait_range_since);
+
 /**
  * filemap_fdatawait_keep_errors - wait for writeback without clearing errors
  * @mapping: address space structure to wait for
@@ -476,6 +484,17 @@ int filemap_fdatawait(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
+int filemap_fdatawait_since(struct address_space *mapping, errseq_t since)
+{
+   loff_t i_size = i_size_read(mapping->host);
+
+   if (i_size == 0)
+   return 0;
+
+   return filemap_fdatawait_range_since(mapping, 0, i_size - 1, since);
+}
+EXPORT_SYMBOL(filemap_fdatawait_since);
+
 int filemap_write_and_wait(struct address_space *mapping)
 {
int err = 0;
@@ -501,6 +520,31 @@ int filemap_write_and_wait(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_write_and_wait);
 
+int filemap_write_and_wait_since(struct address_space *mapping, errseq_t since)
+{
+   int err = 0;
+
+   if ((!dax_mapping(mapping) && mapping->nrpages) ||
+   (dax_mapping(mapping) && mapping->nrexceptional)) {
+   err = filemap_fdatawrite(mapping);
+   /*
+* Even if the above returned error, the pages may be
+* written partially (e.g. -ENOSPC), so we wait for it.
+* But the -EIO is special case, it may indicate the worst
+* thing (e.g. bug) happened, so we avoid waiting for it.
+*/
+   if (err != -EIO) {
+   int err2 = filemap_fdatawait_since(mapping, since);
+   if (!err)
+   err = err2;
+   }
+   } else {
+   err = filemap_check_wb_err(mapping, since);
+   }
+   return err;
+}
+EXPORT_SYMBOL(filemap_write_and_wait_since);
+
 /**
  * filemap_write_and_wait_range - write out & wait on a file range
  * @mapping:   the address_space for the pages
@@ -535,6 +579,29 @@ int filemap_write_and_wait_range(struct address_space 
*mapping,
 }
 EXPORT_SYMBOL(filemap_write_and_wait_range);
 
+int filemap_write_and_wait_range_since(struct address_space *mapping,
+loff_t lstart, loff_t lend, errseq_t since)
+{
+   int err = 0;
+
+   if ((!dax_mapping(mapping) && mapping->nrpages) ||
+   (dax_mapping(mapping) && mapping->nrexceptional)) {
+   err = __filemap_fdatawrite_range(mapping, lstart, lend,
+WB_SYNC_ALL);
+   /* See c

[PATCH v5 11/17] fs: add f_md_wb_err field to struct file for tracking metadata errors

2017-05-31 Thread Jeff Layton
Some filesystems (particularly local ones) keep a different mapping for
metadata writeback. Add a second errseq_t to struct file for tracking
metadata writeback errors. Also add a new function for checking a
mapping of the caller's choosing vs. the f_md_wb_err value.

Signed-off-by: Jeff Layton 
---
 include/linux/fs.h |  3 +++
 include/trace/events/filemap.h | 23 ++-
 mm/filemap.c   | 40 +++-
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f483c23866c4..df1d68e3605a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -871,6 +871,7 @@ struct file {
struct list_headf_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+   errseq_tf_md_wb_err; /* optional metadata wb error 
tracking */
 } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2549,6 +2550,8 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern int __must_check filemap_report_md_wb_err(struct file *file,
+   struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 2af66920f267..6e0d78c01a2e 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -79,12 +79,11 @@ TRACE_EVENT(filemap_set_wb_err,
 );
 
 TRACE_EVENT(filemap_report_wb_err,
-   TP_PROTO(struct file *file, errseq_t old),
+   TP_PROTO(struct address_space *mapping, errseq_t old, errseq_t 
new),
 
-   TP_ARGS(file, old),
+   TP_ARGS(mapping, old, new),
 
TP_STRUCT__entry(
-   __field(struct file *, file);
__field(unsigned long, i_ino)
__field(dev_t, s_dev)
__field(errseq_t, old)
@@ -92,20 +91,18 @@ TRACE_EVENT(filemap_report_wb_err,
),
 
TP_fast_assign(
-   __entry->file = file;
-   __entry->i_ino = file->f_mapping->host->i_ino;
-   if (file->f_mapping->host->i_sb)
-   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   __entry->i_ino = mapping->host->i_ino;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
else
-   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->s_dev = mapping->host->i_rdev;
__entry->old = old;
-   __entry->new = file->f_wb_err;
+   __entry->new = new;
),
 
-   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
-   __entry->file, MAJOR(__entry->s_dev),
-   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
-   __entry->new)
+   TP_printk("dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->old, __entry->new)
 );
 #endif /* _TRACE_FILEMAP_H */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 38a14dc825ad..0edf0234973e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -631,21 +631,20 @@ EXPORT_SYMBOL(__filemap_set_wb_err);
  * value is protected by the f_lock since we must ensure that it reflects
  * the latest value swapped in for this file descriptor.
  */
-int filemap_report_wb_err(struct file *file)
+static int __filemap_report_wb_err(errseq_t *cursor, spinlock_t *lock,
+   struct address_space *mapping)
 {
int err = 0;
-   errseq_t old = READ_ONCE(file->f_wb_err);
-   struct address_space *mapping = file->f_mapping;
+   errseq_t old = READ_ONCE(*cursor);
 
/* Locklessly handle the common case where nothing has changed */
if (errseq_check(&mapping->wb_err, old)) {
/* Something changed, must use slow path */
-   spin_lock(&file->f_lock);
-   old = file->f_wb_err;
-   err = errseq_check_and_advance(&mapping->wb_err,
-   &file->f_wb_err);
-   trace_filemap_report_wb_err(file, old);
-   spin_unlock(&file->f_lock);
+ 

Re: [PATCH 21/36] fs: locks: Fix some troubles at kernel-doc comments

2017-05-12 Thread Jeff Layton
On Fri, 2017-05-12 at 11:00 -0300, Mauro Carvalho Chehab wrote:
> There are a few syntax violations that cause outputs of
> a few comments to not be properly parsed in ReST format.
> 
> No functional changes.
> 
> Signed-off-by: Mauro Carvalho Chehab 
> ---
>  fs/locks.c | 18 --
>  1 file changed, 8 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index 26811321d39b..bdce708e4251 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1858,8 +1858,8 @@ EXPORT_SYMBOL(generic_setlease);
>   *
>   * Call this to establish a lease on the file. The "lease" argument is not
>   * used for F_UNLCK requests and may be NULL. For commands that set or alter
> - * an existing lease, the (*lease)->fl_lmops->lm_break operation must be set;
> - * if not, this function will return -ENOLCK (and generate a scary-looking
> + * an existing lease, the ``(*lease)->fl_lmops->lm_break`` operation must be
> + * set; if not, this function will return -ENOLCK (and generate a 
> scary-looking
>   * stack trace).
>   *
>   * The "priv" pointer is passed directly to the lm_setup function as-is. It
> @@ -1972,15 +1972,13 @@ EXPORT_SYMBOL(locks_lock_inode_wait);
>   *   @cmd: the type of lock to apply.
>   *
>   *   Apply a %FL_FLOCK style lock to an open file descriptor.
> - *   The @cmd can be one of
> + *   The @cmd can be one of:
>   *
> - *   %LOCK_SH -- a shared lock.
> - *
> - *   %LOCK_EX -- an exclusive lock.
> - *
> - *   %LOCK_UN -- remove an existing lock.
> - *
> - *   %LOCK_MAND -- a `mandatory' flock.  This exists to emulate Windows 
> Share Modes.
> + *   - %LOCK_SH -- a shared lock.
> + *   - %LOCK_EX -- an exclusive lock.
> + *   - %LOCK_UN -- remove an existing lock.
> + *   - %LOCK_MAND -- a 'mandatory' flock.
> + * This exists to emulate Windows Share Modes.
>   *
>   *   %LOCK_MAND can be combined with %LOCK_READ or %LOCK_WRITE to allow other
>   *   processes read and write access respectively.

LGTM. Do you need me or Bruce to pick this one up?

Reviewed-by: Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] ceph: set io_pages bdi hint

2017-01-10 Thread Jeff Layton
On Tue, 2017-01-10 at 14:17 +0100, Andreas Gerstmayr wrote:
> This patch sets the io_pages bdi hint based on the rsize mount option.
> Without this patch large buffered reads (request size > max readahead)
> are processed sequentially in chunks of the readahead size (i.e. read
> requests are sent out up to the readahead size, then the
> do_generic_file_read() function waits until the first page is received).
> 
> With this patch read requests are sent out at once up to the size
> specified in the rsize mount option (default: 64 MB).
> 
> Signed-off-by: Andreas Gerstmayr 
> ---
> 
> Changes in v4:
>   - update documentation
> 
> (Note: This patch depends on kernel version 4.10-rc1)
> 
> 
>  Documentation/filesystems/ceph.txt | 5 ++---
>  fs/ceph/super.c| 8 
>  fs/ceph/super.h| 4 ++--
>  3 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/filesystems/ceph.txt 
> b/Documentation/filesystems/ceph.txt
> index f5306ee..0b302a1 100644
> --- a/Documentation/filesystems/ceph.txt
> +++ b/Documentation/filesystems/ceph.txt
> @@ -98,11 +98,10 @@ Mount Options
>   size.
>  
>rsize=X
> - Specify the maximum read size in bytes.  By default there is no
> - maximum.
> + Specify the maximum read size in bytes.  Default: 64 MB.
>  
>rasize=X
> - Specify the maximum readahead.
> + Specify the maximum readahead.  Default: 8 MB.
>  
>mount_timeout=X
>   Specify the timeout value for mount (in seconds), in the case
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index 6bd20d7..a0a0b6d 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -952,6 +952,14 @@ static int ceph_register_bdi(struct super_block *sb,
>   fsc->backing_dev_info.ra_pages =
>   VM_MAX_READAHEAD * 1024 / PAGE_SIZE;
>  
> + if (fsc->mount_options->rsize > fsc->mount_options->rasize &&
> + fsc->mount_options->rsize >= PAGE_SIZE)
> + fsc->backing_dev_info.io_pages =
> + (fsc->mount_options->rsize + PAGE_SIZE - 1)
> + >> PAGE_SHIFT;
> + else if (fsc->mount_options->rsize == 0)
> + fsc->backing_dev_info.io_pages = ULONG_MAX;
> +
>   err = bdi_register(&fsc->backing_dev_info, NULL, "ceph-%ld",
>  atomic_long_inc_return(&bdi_seq));
>   if (!err)
> diff --git a/fs/ceph/super.h b/fs/ceph/super.h
> index 3373b61..88b2e6e 100644
> --- a/fs/ceph/super.h
> +++ b/fs/ceph/super.h
> @@ -45,8 +45,8 @@
>  #define ceph_test_mount_opt(fsc, opt) \
>   (!!((fsc)->mount_options->flags & CEPH_MOUNT_OPT_##opt))
>  
> -#define CEPH_RSIZE_DEFAULT 0   /* max read size */
> -#define CEPH_RASIZE_DEFAULT(8192*1024) /* readahead */
> +#define CEPH_RSIZE_DEFAULT  (64*1024*1024) /* max read size */
> +#define CEPH_RASIZE_DEFAULT (8192*1024)/* max readahead */
>  #define CEPH_MAX_READDIR_DEFAULT1024
>  #define CEPH_MAX_READDIR_BYTES_DEFAULT  (512*1024)
>  #define CEPH_SNAPDIRNAME_DEFAULT".snap"

Acked-by: Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html