from:"Jeff Layton"

[PATCH RESEND] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jeff Layton

From: Jeff Layton <jlay...@redhat.com>

When a loop block device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. If we then detach the
backing file from it, attach another and fsync it, we'll get back the
writeback error that we had from the previous backing file.

This is a bit of a grey area as POSIX doesn't cover loop devices, but it
is somewhat counterintuitive.

If we detach a backing file from the loopdev while there are still
unreported errors, take it as a sign that we're no longer interested in
the previous file, and clear out the wb_err in the loop blockdev.

Reported-and-Tested-by: Theodore Y. Ts'o <ty...@mit.edu>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

Resend to fix the attributions.

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..55cf554bc914 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
if (bdev) {
bdput(bdev);
invalidate_bdev(bdev);
+   bdev->bd_inode->i_mapping->wb_err = 0;
}
set_capacity(lo->lo_disk, 0);
loop_sysfs_exit(lo);
-- 
2.17.0

Re: [PATCH] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jeff Layton

On Mon, 2018-05-21 at 08:37 -0400, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> When a loop block device encounters a writeback error, that error will
> get propagated to the bd_inode's wb_err field. If we then detach the
> backing file from it, attach another and fsync it, we'll get back the
> writeback error that we had from the previous backing file.
> 
> This is a bit of a grey area as POSIX doesn't cover loop devices, but it
> is somewhat counterintuitive.
> 
> If we detach a backing file from the loopdev while there are still
> unreported errors, take it as a sign that we're no longer interested in
> the previous file, and clear out the wb_err in the loop blockdev.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  drivers/block/loop.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 5d4e31655d96..55cf554bc914 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
>   if (bdev) {
>   bdput(bdev);
>   invalidate_bdev(bdev);
> + bdev->bd_inode->i_mapping->wb_err = 0;
>   }
>   set_capacity(lo->lo_disk, 0);
>   loop_sysfs_exit(lo);

Jens,

I sent this to the mailing lists earlier, but it occurs to me that
loop.c changes should probably go through the block tree. Would you
mind picking this up and feeding it to next and on to Linus?

I think we probably want this in v4.17 if possible, as it is a
regression from v4.16 (due to commit b4678df184b3) and was causing some
failures in xfstests runs.

Hmm, now that I think about it, we should probably also give Ted
"Reported-and-Tested-by" credit here too. Let me know if you'd rather I
resend with that added.

Thanks,
-- 
Jeff Layton <jlay...@kernel.org>

[PATCH] loop: clear wb_err in bd_inode when detaching backing file

2018-05-21 Thread Jeff Layton

From: Jeff Layton <jlay...@redhat.com>

When a loop block device encounters a writeback error, that error will
get propagated to the bd_inode's wb_err field. If we then detach the
backing file from it, attach another and fsync it, we'll get back the
writeback error that we had from the previous backing file.

This is a bit of a grey area as POSIX doesn't cover loop devices, but it
is somewhat counterintuitive.

If we detach a backing file from the loopdev while there are still
unreported errors, take it as a sign that we're no longer interested in
the previous file, and clear out the wb_err in the loop blockdev.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5d4e31655d96..55cf554bc914 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1068,6 +1068,7 @@ static int loop_clr_fd(struct loop_device *lo)
if (bdev) {
bdput(bdev);
invalidate_bdev(bdev);
+   bdev->bd_inode->i_mapping->wb_err = 0;
}
set_capacity(lo->lo_disk, 0);
loop_sysfs_exit(lo);
-- 
2.17.0

Re: [PATCH] fs: clear writeback errors in inode_init_always

2018-05-21 Thread Jeff Layton

On Sun, 2018-05-20 at 15:41 -0400, Theodore Y. Ts'o wrote:
> On Sun, May 20, 2018 at 12:20:09PM -0700, Matthew Wilcox wrote:
> > Oh!  I misunderstood.  I thought that bd_inode->i_mapping pointed to
> > lo_backing_file->f_mapping.
> > 
> > So how about this to preserve the error _in the file with the error_?
> > 
> > if (bdev) {
> > bdput(bdev);
> > invalidate_bdev(bdev);
> > +   filp->f_mapping->wb_err = bdev->bd_inode->i_mapping->wb_err;
> > +   bdev->bd_inode->i_mapping->wb_err = 0;
> > }
> > set_capacity(lo->lo_disk, 0);
> > loop_sysfs_exit(lo);
> 
> I don't think it's necessary.  wb_err on the underlying file would
> have already been set when the error was set.  So if someone tried
> calling fsync(2) on the underlying file, it would have collected the
> error already.  Re-setting the error when we detach the loop device
> doesn't seem to make any sense.
> 

(cc'ing linux-block)

Yes, the error would have been set on the original file already. As long
as lo_req_flush or __loop_update_dio weren't called before we detach the
device, it should still be possible to call fsync on the original fd and
get back the error.

As a side note, it looks like __loop_update_dio will discard an error
from vfs_fsync, so certain ioctls against a loop device can cause errors
to be lost. It seems like those ought to get propagated to userland or
to the blockdev's mapping somehow.
-- 
Jeff Layton <jlay...@kernel.org>

Re: write call hangs in kernel space after virtio hot-remove

2018-05-03 Thread Jeff Layton

ot issue any fsync()
> > calls.
> > 
> > 
> > Does my analysis make any sense and would something along these lines be
> > acceptable as a solution?
> 
> Thanks for the debugging of the problem. I agree with your analysis however
> I don't like your fix. The issue is that when bdi is unregistered we don't
> really expect any writeback to happen after that moment. This is what
> prevents various use-after-free issues and I'd like that to stay the way it
> is.
> 
> What I think we should do is that we'll prevent dirtying of new pages when
> we know the underlying device is gone. Because that will fix your problem
> and also make sure user sees the IO errors directly instead of just in the
> kernel log. The question is how to make this happen in the least painful
> way. I think we could intercept writes in grab_cache_page_write_begin()
> (which however requires that function to return a proper error code and not
> just NULL / non-NULL). And we should also intercept write faults to not
> allow page dirtying via mmap - probably somewhere in do_shared_fault() and
> do_wp_page(). I've added Jeff to CC since he's dealing with IO error
> handling a lot these days. Jeff, what do you think?
> 
>   Honza

(cc'ing Willy too since he's given this more thought than me)

For the record, I've mostly been looking at error _reporting_. Handling
errors at this level is not something I've really considered in great
detail as of yet.

Still, I think the basic idea sounds reasonable. Not allowing pages to
be dirtied when we can't clean them seems like a reasonable thing to
do.

The big question is how we'll report this to userland:

Would your approach have it return an error on write() and such? What
sort of error if so? ENODEV? Would we have to SIGBUS when someone tries
to dirty the page through mmap?
--
Jeff Layton <jlay...@kernel.org>

Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jeff Layton

On Thu, 2018-04-19 at 13:26 -0400, Jerome Glisse wrote:
> On Thu, Apr 19, 2018 at 12:58:39PM -0400, Jeff Layton wrote:
> > On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> > > On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > > > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > > > any people against having a callback everytime a struct file is added
> > > > > to a task_struct and also having a secondary array so that special
> > > > > file like device file can store something opaque per task_struct per
> > > > > struct file.
> > > > 
> > > > Do you really want something per _thread_, and not per _mm_?
> > > 
> > > Well per mm would be fine but i do not see how to make that happen with
> > > reasonable structure. So issue is that you can have multiple task with
> > > same mm but different file descriptors (or am i wrong here ?) thus there
> > > would be no easy way given a struct file to lookup the per mm struct.
> > > 
> > > So as a not perfect solution i see a new array in filedes which would
> > > allow device driver to store a pointer to their per mm data structure.
> > > To be fair usualy you will only have a single fd in a single task for
> > > a given device.
> > > 
> > > If you see an easy way to get a per mm per inode pointer store somewhere
> > > with easy lookup i am all ears :)
> > > 
> > 
> > I may be misunderstanding, but to be clear: struct files don't get
> > added to a thread, per-se.
> > 
> > When userland calls open() or similar, the struct file gets added to
> > the files_struct. Those are generally shared with other threads within
> > the same process. The files_struct can also be shared with other
> > processes if you clone() with the right flags.
> > 
> > Doing something per-thread on every open may be rather difficult to do.
> 
> Basicly i want a callback in __fd_install(), do_dup2(), dup_fd() and
> add void * *private_data; to struct fdtable (also a default array to
> struct files_struct). The callback would be part of struct file_operations.
> and only call if it exist (os overhead is only for device driver that
> care).
> 
> Did i miss something fundamental ? copy_files() call dup_fd() so i
> should be all set here.
> 
> I will work on patches i was hoping this would not be too much work.
> 

No, I think I misunderstood. I was thinking you wanted to iterate over
all of the threads that might be associated with a struct file, and
that's rather non-trivial.

A callback when you add a file to the files_struct seems like it would
probably be OK (in principle).
-- 
Jeff Layton <jlay...@kernel.org>

Re: [LSF/MM] schedule suggestion

2018-04-19 Thread Jeff Layton

On Thu, 2018-04-19 at 12:30 -0400, Jerome Glisse wrote:
> On Thu, Apr 19, 2018 at 07:43:56AM -0700, Matthew Wilcox wrote:
> > On Thu, Apr 19, 2018 at 10:38:25AM -0400, Jerome Glisse wrote:
> > > Oh can i get one more small slot for fs ? I want to ask if they are
> > > any people against having a callback everytime a struct file is added
> > > to a task_struct and also having a secondary array so that special
> > > file like device file can store something opaque per task_struct per
> > > struct file.
> > 
> > Do you really want something per _thread_, and not per _mm_?
> 
> Well per mm would be fine but i do not see how to make that happen with
> reasonable structure. So issue is that you can have multiple task with
> same mm but different file descriptors (or am i wrong here ?) thus there
> would be no easy way given a struct file to lookup the per mm struct.
> 
> So as a not perfect solution i see a new array in filedes which would
> allow device driver to store a pointer to their per mm data structure.
> To be fair usualy you will only have a single fd in a single task for
> a given device.
> 
> If you see an easy way to get a per mm per inode pointer store somewhere
> with easy lookup i am all ears :)
> 

I may be misunderstanding, but to be clear: struct files don't get
added to a thread, per-se.

When userland calls open() or similar, the struct file gets added to
the files_struct. Those are generally shared with other threads within
the same process. The files_struct can also be shared with other
processes if you clone() with the right flags.

Doing something per-thread on every open may be rather difficult to do.
-- 
Jeff Layton <jlay...@kernel.org>

Re: [PATCH 00/11] fs: use freeze_fs on suspend/hibernate

2017-12-01 Thread Jeff Layton

On Thu, 2017-11-30 at 17:41 +0100, Jiri Kosina wrote:
> On Fri, 1 Dec 2017, Yu Chen wrote:
> 
> > BTW, is nfs able to be included in this set? I also encountered a 
> > freeze() failure due to nfs access during that stage recently.
> 
> The freezer usage in NFS is magnitudes more complicated, so it makes sense 
> to first go after the lower hanging fruit to figure out the viability of 
> the whole aproach in practice.
> 

Agreed that we should do this in stages. It doesn't help that freezer
handling in the client is a bit of a mess at this point...

At a high level for NFS, I think we need to have freeze_fs make the RPC
engine "park" newly issued RPCs for that fs' client onto a
rpc_wait_queue. Any RPC that has already been sent however, we need to
wait for a reply.

Once everything is quiesced we can return and call it frozen.
unfreeze_fs can then just have the engine stop parking RPCs and wake up
the waitq.

That should be enough to make suspend and resume work more reliably. If,
however, you're interested in making the cgroup freezer also work, then
we may need to do a bit more work to ensure that we don't end up with
frozen tasks squatting on VFS locks.
-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v8 17/18] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-30 Thread Jeff Layton

On Thu, 2017-06-29 at 07:12 -0700, Christoph Hellwig wrote:
> Nice and simple, this looks great!
> 
> Reviewed-by: Christoph Hellwig <h...@lst.de>

Thanks! I think this turned out to be a lot cleaner too.

For filesystems that use filemap_write_and_wait_range today this now
becomes a pretty straight conversion to file_write_and_wait_range -- one
liner patches for the most part.

I've started rolling patches to do that, but now I'm wondering...

Should I aim to do that with an individual patch for each fs, or is it
better to do a swath of them all at once in a single patch here?
-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v8 18/18] btrfs: minimal conversion to errseq_t writeback error reporting on fsync

2017-06-29 Thread Jeff Layton

On Thu, 2017-06-29 at 07:17 -0700, Christoph Hellwig wrote:
> On Thu, Jun 29, 2017 at 09:19:54AM -0400, jlay...@kernel.org wrote:
> > From: Jeff Layton <jlay...@redhat.com>
> > 
> > Just check and advance the errseq_t in the file before returning.
> > Internal callers of filemap_* functions are left as-is.
> > 
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > ---
> >  fs/btrfs/file.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > index da1096eb1a40..1f57e1a523d9 100644
> > --- a/fs/btrfs/file.c
> > +++ b/fs/btrfs/file.c
> > @@ -2011,7 +2011,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> > loff_t end, int datasync)
> > struct btrfs_root *root = BTRFS_I(inode)->root;
> > struct btrfs_trans_handle *trans;
> > struct btrfs_log_ctx ctx;
> > -   int ret = 0;
> > +   int ret = 0, err;
> > bool full_sync = 0;
> > u64 len;
> >  
> > @@ -2030,7 +2030,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> > loff_t end, int datasync)
> >  */
> > ret = start_ordered_ops(inode, start, end);
> > if (ret)
> > -   return ret;
> > +   goto out;
> >  
> > inode_lock(inode);
> > atomic_inc(>log_batch);
> > @@ -2227,6 +2227,9 @@ int btrfs_sync_file(struct file *file, loff_t start, 
> > loff_t end, int datasync)
> > ret = btrfs_end_transaction(trans);
> > }
> >  out:
> > +   err = file_check_and_advance_wb_err(file);
> > +   if (!ret)
> > +   ret = err;
> > return ret > 0 ? -EIO : ret;
> 
> This means that we'll lose the exact error returned from
> start_ordered_ops.  Beyond that I can't really provide good feedback
> as the btrfs fsync code looks so much different from all the other
> fs fsync code..

Well, no...we'll keep the error from start_ordered_ops if there was one.
 We just advance the cursor past any stored error in that case without
returning it.

I have another fix for this patch too: there's a call to
filemap_check_errors in this function that I think should probably use
filemap_check_wb_err instead. Fixed in my tree.

I do agree though that while this works in my testing I'd like the btrfs
guys to ACK this as I don't fully grok the btrfs fsync code at all.
-- 
Jeff Layton <jlay...@poochiereds.net>

Re: [PATCH v8 16/18] ext4: use errseq_t based error handling for reporting data writeback errors

2017-06-29 Thread Jeff Layton

On Thu, 2017-06-29 at 07:12 -0700, Christoph Hellwig wrote:
> > -   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
> > -   return -EIO;
> > +   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb {
> > +   ret = -EIO;
> > +   goto out;
> > +   }
> 
> This just seems to add a call to trace_ext4_sync_file_exit for this
> case, which seems unrelated to the patch.
> 
> > if (ret)
> > -   return ret;
> > +   goto out;
> > +
> 
> Same here.
> 
> > /*
> >  * data=writeback,ordered:
> >  *  The caller's filemap_fdatawrite()/wait will sync the data.
> > @@ -152,7 +155,7 @@ int ext4_sync_file(struct file *file, loff_t start, 
> > loff_t end, int datasync)
> > needs_barrier = true;
> > ret = jbd2_complete_transaction(journal, commit_tid);
> > if (needs_barrier) {
> > -   issue_flush:
> > +issue_flush:
> > err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
> 
> And while I much prefer your new label placement it also doesn't
> seem to belong into this patch.

I revised the patch description earlier to say:

While we're at it, ensure we always "goto out" instead of just
returning directly, so that we always hit the exit tracepoint.

...but I'm fine with taking that out if you prefer.
-- 
Jeff Layton <jlay...@poochiereds.net>

Re: [PATCH v8 03/18] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-06-29 Thread Jeff Layton

On Thu, 2017-06-29 at 07:19 -0700, Christoph Hellwig wrote:
> On Thu, Jun 29, 2017 at 09:19:39AM -0400, jlay...@kernel.org wrote:
> > From: Jeff Layton <jlay...@redhat.com>
> > 
> > ext2 currently does a test+clear of the AS_EIO flag, which is
> > is problematic for some coming changes.
> > 
> > What we really need to do instead is call filemap_check_errors
> > in __generic_file_fsync after syncing out the buffers. That
> > will be sufficient for this case, and help other callers detect
> > these errors properly as well.
> > 
> > With that, we don't need to twiddle it in ext2.
> 
> Seems like much of this code is getting replaced later in the
> series..


It does. I suppose I could squash this in with the __generic_file_fsync
patch.

-- 
Jeff Layton <jlay...@poochiereds.net>

Re: [PATCH v8 12/18] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-29 Thread Jeff Layton

On Thu, 2017-06-29 at 10:11 -0700, Darrick J. Wong wrote:
> On Thu, Jun 29, 2017 at 09:19:48AM -0400, jlay...@kernel.org wrote:
> > From: Jeff Layton <jlay...@redhat.com>
> > 
> > Let's try to make this extra clear for fs authors.
> > 
> > Cc: Jan Kara <j...@suse.cz>
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > ---
> >  Documentation/filesystems/vfs.txt | 43 
> > ---
> >  1 file changed, 40 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/vfs.txt 
> > b/Documentation/filesystems/vfs.txt
> > index f42b90687d40..1366043b3942 100644
> > --- a/Documentation/filesystems/vfs.txt
> > +++ b/Documentation/filesystems/vfs.txt
> > @@ -576,7 +576,42 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
> > actually
> >  written at any point after PG_Dirty is clear.  Once it is known to be
> >  safe, PG_Writeback is cleared.
> >  
> > -Writeback makes use of a writeback_control structure...
> > +Writeback makes use of a writeback_control structure to direct the
> > +operations.  This gives the the writepage and writepages operations some
> > +information about the nature of and reason for the writeback request,
> > +and the constraints under which it is being done.  It is also used to
> > +return information back to the caller about the result of a writepage or
> > +writepages request.
> > +
> > +Handling errors during writeback
> > +
> > +Most applications that utilize the pagecache will periodically call
> > +fsync to ensure that data written has made it to the backing store.
> 
> /me wonders if this sentence ought to be worded more strongly, e.g.
> 
> "Applications that utilize the pagecache must call a data
> synchronization syscall such as fsync, fdatasync, or msync to ensure
> that data written has made it to the backing store."
> 

Well...only if they care about the data. There are some that don't. :)

> I'm also wondering -- fdatasync and msync will also report any writeback
> errors that have happened anywhere (like fsync), since they all map to
> vfs_fsync_range, correct?  If so, I think it worth it to state
> explicitly that the other *sync methods behave the same as fsync w.r.t.
> writeback error reporting.
> 

Good point. I'll fix this to make it clear that fsync, msync and
fdatasync all advance the error cursor in the same way.

While we're on the subject...

What should we do about sync_file_range here? It doesn't currently call
any filesystem operations directly, so we don't have a good way to make
it selectively use errseq_t handling there.

I could resurrect the FS_* flag for that, though I don't really like
that. Should I just go ahead and convert it over to use errseq_t under
the theory that most callers will eventually want that anyway?

Thanks for the review so far!

> > +When there is an error during writeback, they expect that error to be
> > +reported when fsync is called.  After an error has been reported on one
> > +fsync, subsequent fsync calls on the same file descriptor should return
> > +0, unless further writeback errors have occurred since the previous
> > +fsync.
> > +
> > +Ideally, the kernel would report an error only on file descriptions on
> > +which writes were done that subsequently failed to be written back.  The
> > +generic pagecache infrastructure does not track the file descriptions
> > +that have dirtied each individual page however, so determining which
> > +file descriptors should get back an error is not possible.
> > +
> > +Instead, the generic writeback error tracking infrastructure in the
> > +kernel settles for reporting errors to fsync on all file descriptions
> > +that were open at the time that the error occurred.  In a situation with
> > +multiple writers, all of them will get back an error on a subsequent fsync,
> > +even if all of the writes done through that particular file descriptor
> > +succeeded (or even if there were no writes on that file descriptor at all).
> > +
> > +Filesystems that wish to use this infrastructure should call
> > +mapping_set_error to record the error in the address_space when it
> > +occurs.  Then, at the end of their fsync operation, they should call
> > +file_check_and_advance_wb_err to ensure that the struct file's error
> > +cursor has advanced to the correct point in the stream of errors emitted
> > +by the backing device(s).
> >  
> >  struct address_space_operations
> >  ---
> > @@ -804,7 +839,8 @@ struct address_space_operations {
> >  The File Ob

Re: [PATCH v8 10/18] fs: new infrastructure for writeback error handling and reporting

2017-06-29 Thread Jeff Layton

On Thu, 2017-06-29 at 09:19 -0400, jlay...@kernel.org wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> The non-fsync callers are problematic. We should be reporting writeback
> errors during fsync, but many places spread over the tree clear out
> errors before they can be properly reported, or report errors at
> nonsensical times.
> 
> If I get -EIO on a stat() call, there is no reason for me to assume that
> it is because some previous writeback failed. The fact that it also
> clears out the error such that a subsequent fsync returns 0 is a bug,
> and a nasty one since that's potentially silent data corruption.
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during address_space writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fds may not be associated with one another in any way. They could even
> be in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds an errseq_t to struct address_space, and a corresponding
> one is added to struct file. Writeback errors are recorded in the
> mapping's errseq_t, and the one in struct file is used as the "since"
> value.
> 
> This changes the semantics of the Linux fsync implementation such that
> applications can now use it to determine whether there were any
> writeback errors since fsync(fd) was last called (or since the file was
> opened in the case of fsync having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The new behavior is still consistent with the POSIX spec, and is more
> reliable for application developers. This patch just adds some basic
> infrastructure for doing this, and ensures that the f_wb_err "cursor"
> is properly set when a file is opened. Later patches will change the
> existing code to use this new infrastructure for reporting errors at
> fsync time.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Reviewed-by: Jan Kara <j...@suse.cz>
> ---
>  drivers/dax/device.c   |  1 +
>  fs/block_dev.c |  1 +
>  fs/file_table.c|  1 +
>  fs/open.c  |  3 ++
>  include/linux/fs.h | 60 -
>  include/trace/events/filemap.h | 57 
>  mm/filemap.c   | 86 
> ++
>  7 files changed, 208 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 006e657dfcb9..12943d19bfc4 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file 
> *filp)
>   inode->i_mapping = __dax_inode->i_mapping;
>   inode->i_mapping->host = __dax_inode;
>   filp->f_mapping = inode->i_mapping;
> + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
>   filp->private_data = dev_dax;
>   inode->i_flags = S_DAX;
>  
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 51959936..4d62fe771587 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct 
> file * filp)
>   return -ENOMEM;
>  
>   filp->f_mapping = bdev->bd_

Re: [PATCH v8 10/18] fs: new infrastructure for writeback error handling and reporting

2017-06-29 Thread Jeff Layton

On Thu, 2017-06-29 at 09:19 -0400, jlay...@kernel.org wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> The non-fsync callers are problematic. We should be reporting writeback
> errors during fsync, but many places spread over the tree clear out
> errors before they can be properly reported, or report errors at
> nonsensical times.
> 
> If I get -EIO on a stat() call, there is no reason for me to assume that
> it is because some previous writeback failed. The fact that it also
> clears out the error such that a subsequent fsync returns 0 is a bug,
> and a nasty one since that's potentially silent data corruption.
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during address_space writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fds may not be associated with one another in any way. They could even
> be in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds an errseq_t to struct address_space, and a corresponding
> one is added to struct file. Writeback errors are recorded in the
> mapping's errseq_t, and the one in struct file is used as the "since"
> value.
> 
> This changes the semantics of the Linux fsync implementation such that
> applications can now use it to determine whether there were any
> writeback errors since fsync(fd) was last called (or since the file was
> opened in the case of fsync having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The new behavior is still consistent with the POSIX spec, and is more
> reliable for application developers. This patch just adds some basic
> infrastructure for doing this, and ensures that the f_wb_err "cursor"
> is properly set when a file is opened. Later patches will change the
> existing code to use this new infrastructure for reporting errors at
> fsync time.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> Reviewed-by: Jan Kara <j...@suse.cz>
> ---
>  drivers/dax/device.c   |  1 +
>  fs/block_dev.c |  1 +
>  fs/file_table.c|  1 +
>  fs/open.c  |  3 ++
>  include/linux/fs.h | 60 -
>  include/trace/events/filemap.h | 57 
>  mm/filemap.c   | 86 
> ++
>  7 files changed, 208 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 006e657dfcb9..12943d19bfc4 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file 
> *filp)
>   inode->i_mapping = __dax_inode->i_mapping;
>   inode->i_mapping->host = __dax_inode;
>   filp->f_mapping = inode->i_mapping;
> + filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
>   filp->private_data = dev_dax;
>   inode->i_flags = S_DAX;
>  
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 51959936..4d62fe771587 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct 
> file * filp)
>   return -ENOMEM;
>  
>   filp->f_mapping = bdev->bd_

Re: [PATCH v7 16/22] block: convert to errseq_t based writeback error tracking

2017-06-26 Thread Jeff Layton

On Sat, 2017-06-24 at 09:16 -0400, Jeff Layton wrote:
> On Sat, 2017-06-24 at 04:59 -0700, Christoph Hellwig wrote:
> > On Tue, Jun 20, 2017 at 01:44:44PM -0400, Jeff Layton wrote:
> > > In order to query for errors with errseq_t, you need a previously-
> > > sampled point from which to check. When you call
> > > filemap_write_and_wait_range though you don't have a struct file and so
> > > no previously-sampled value.
> > 
> > So can we simply introduce variants of them that take a struct file?
> > That would be:
> > 
> >  a) less churn
> >  b) less code
> >  c) less chance to get data integrity wrong
> 
> Yeah, I had that thought after I sent the reply to you earlier.
> 
> The main reason I didn't do that before was that I had myself convinced
> that we needed to do the check_and_advance as late as possible in the
> fsync process, after the metadata had been written.
> 
> Now that I think about it more, I think you're probably correct. As long
> as we do the check and advance at some point after doing the
> write_and_wait, we're fine here and shouldn't violate exactly once
> semantics on the fsync return.

So I have a file_write_and_wait_range now that should DTRT for this
patch.

The bigger question is -- what about more complex filesystems like
ext4?  There are a couple of cases where we can return -EIO or -EROFS on
fsync before filemap_write_and_wait_range is ever called. Like this one
for instance:

if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
return -EIO;

...and the EXT4_MF_FS_ABORTED case.

Are those conditions ever recoverable, such that a later fsync could
succeed? IOW, could I do a remount or something such that the existing
fds are left open and become usable again? 

If so, then we really ought to advance the errseq_t in the file when we
catch those cases as well. If we have to do that, then it probably makes
sense to leave the ext4 patch as-is.
-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v7 16/22] block: convert to errseq_t based writeback error tracking

2017-06-24 Thread Jeff Layton

On Sat, 2017-06-24 at 04:59 -0700, Christoph Hellwig wrote:
> On Tue, Jun 20, 2017 at 01:44:44PM -0400, Jeff Layton wrote:
> > In order to query for errors with errseq_t, you need a previously-
> > sampled point from which to check. When you call
> > filemap_write_and_wait_range though you don't have a struct file and so
> > no previously-sampled value.
> 
> So can we simply introduce variants of them that take a struct file?
> That would be:
> 
>  a) less churn
>  b) less code
>  c) less chance to get data integrity wrong

Yeah, I had that thought after I sent the reply to you earlier.

The main reason I didn't do that before was that I had myself convinced
that we needed to do the check_and_advance as late as possible in the
fsync process, after the metadata had been written.

Now that I think about it more, I think you're probably correct. As long
as we do the check and advance at some point after doing the
write_and_wait, we're fine here and shouldn't violate exactly once
semantics on the fsync return.
-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v7 16/22] block: convert to errseq_t based writeback error tracking

2017-06-20 Thread Jeff Layton

On Tue, 2017-06-20 at 05:35 -0700, Christoph Hellwig wrote:
> > error = filemap_write_and_wait_range(filp->f_mapping, start, end);
> > if (error)
> > -   return error;
> > +   goto out;
> >  
> > /*
> >  * There is no need to serialise calls to blkdev_issue_flush with
> > @@ -640,6 +640,10 @@ int blkdev_fsync(struct file *filp, loff_t start, 
> > loff_t end, int datasync)
> > if (error == -EOPNOTSUPP)
> > error = 0;
> >  
> > +out:
> > +   wberr = filemap_report_wb_err(filp);
> > +   if (!error)
> > +   error = wberr;
> 
> Just curious: what's the reason filemap_write_and_wait_range couldn't
> query for the error using filemap_report_wb_err internally?

In order to query for errors with errseq_t, you need a previously-
sampled point from which to check. When you call
filemap_write_and_wait_range though you don't have a struct file and so
no previously-sampled value.

-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v7 00/22] fs: enhanced writeback error reporting with errseq_t (pile #1)

2017-06-20 Thread Jeff Layton

On Tue, 2017-06-20 at 09:25 +1000, Stephen Rothwell wrote:
> Hi Jeff,
> 
> On Mon, 19 Jun 2017 12:23:46 -0400 Jeff Layton <jlay...@redhat.com> wrote:
> > 
> > If there are no major objections to this set, I'd like to have
> > linux-next start picking it up to get some wider testing. What's the
> > right vehicle for this, given that it touches stuff all over the tree?
> > 
> > I can see 3 potential options:
> > 
> > 1) I could just pull these into the branch that Stephen is already
> > picking up for file-locks in my tree
> > 
> > 2) I could put them into a new branch, and have Stephen pull that one in
> > addition to the file-locks branch
> > 
> > 3) It could go in via someone else's tree entirely (Andrew or Al's
> > maybe?)
> > 
> > I'm fine with any of these. Anyone have thoughts?
> 
> Given that this is a one off development, either 1 or 3 (in Al's tree)
> would be fine.  2 is a possibility (but people forget to ask me to
> remove one shot trees :-()
> 

Ok -- yeah, I'd probably be one of those people who forget too...

In that case, I'll plan to go ahead and just merge these into my
linux-next branch. That's easier than bugging others for it. Hopefully
we won't have a lot in the way of merge conflicts.

I'll see about getting this into branch later today, and hopefully we
can get it into linux-next for tomorrow.

Thanks!
-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v7 00/22] fs: enhanced writeback error reporting with errseq_t (pile #1)

2017-06-19 Thread Jeff Layton

On Fri, 2017-06-16 at 15:34 -0400, Jeff Layton wrote:
> v7:
> ===
> This is the seventh posting of the patchset to revamp the way writeback
> errors are tracked and reported.
> 
> The main difference from the v6 posting is the removal of the
> FS_WB_ERRSEQ flag. That requires a few other incremental patches in the
> writeback code to ensure that both error tracking models are handled
> in a suitable way.
> 
> Also, a bit more cleanup of the metadata writeback codepaths, and some
> documentation updates.
> 
> Some of these patches have been posted separately, but I'm re-posting
> them here to make it clear that they're prerequisites of the later
> patches in the series.
> 
> This series is based on top of linux-next from a day or so ago. I'd like
> to have this picked up by linux-next in the near future so we can get
> some more extensive testing with it. Should I just plan to maintain a
> topic branch on top of -next and ask Stephen to pick it up?
> 
> Background:
> ===
> The basic problem is that we have (for a very long time) tracked and
> reported writeback errors based on two flags in the address_space:
> AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked,
> so only the first caller to check them is able to consume them.
> 
> That model is quite unreliable, for several related reasons:
> 
> * only the first fsync caller on the inode will see the error. In a
>   world of containerized setups, that's no longer viable. Applications
>   need to know that their writes are safely stored, and they can
>   currently miss seeing errors that they should be aware of when
>   they're not.
> 
> * there are a lot of internal callers to filemap_fdatawait* and
>   filemap_write_and_wait* that clear these errors but then never report
>   them to userland in any fashion.
> 
> * Some internal callers report writeback errors, but can do so at
>   non-sensical times. For instance, we might want to truncate a file,
>   which triggers a pagecache flush. If that writeback fails, we might
>   report that error to the truncate caller, but a subsequent fsync
>   will likely not see it.
> 
> * Some internal callers try to reset the error flags after clearing
>   them, but that's racy. Another task could check the flags between
>   those two events.
> 
> Solution:
> =
> This patchset adds a new datatype called an errseq_t that represents a
> sequence of errors. It's a u32, with a field for a POSIX-flavor error
> and a counter, managed with atomics. We can sample that value at a
> particular point in time, and can later tell whether there have been any
> errors since that point.
> 
> That allows us to provide traditional check-and-clear fsync semantics
> on every open file description in a lightweight fashion. fsync callers
> no longer need to coordinate between one another in order to ensure
> that errors at fsync time are handled correctly.
> 
> Strategy:
> =
> The aim with this pile is to do the minimum possible to support for
> reliable reporting of errors on fsync, without substantially changing
> the internals of the filesystems themselves.
> 
> Most of the internal calls to filemap_fdatawait are left alone, so all
> of the internal error checkers are using the same error handling they
> always have. The only real difference here is that we're better
> reporting errors at fsync.
> 
> I think that we probably will want to eventually convert all of those
> internal callers to use errseq_t based reporting, but that can be done
> in an incremental fashion in follow-on patchsets.
> 
> Testing:
> 
> I've primarily been testing this with some new xfstests that I will post
> in a separate series. These tests use dm-error fault injection to make
> the underlying block device start throwing I/O errors, and then test the
> how the filesystem layer reports errors after that.
> 
> Jeff Layton (22):
>   fs: remove call_fsync helper function
>   buffer: use mapping_set_error instead of setting the flag
>   fs: check for writeback errors after syncing out buffers in
> generic_file_fsync
>   buffer: set errors in mapping at the time that the error occurs
>   jbd2: don't clear and reset errors after waiting on writeback
>   mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
>   mm: don't TestClearPageError in __filemap_fdatawait_range
>   mm: clean up error handling in write_one_page
>   fs: always sync metadata in __generic_file_fsync
>   lib: add errseq_t type and infrastructure for handling it
>   fs: new infrastructure for writeback error handling and reporting
>   mm: tracepoints for writeback error events
>   mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
>

[PATCH v7 00/22] fs: enhanced writeback error reporting with errseq_t (pile #1)

2017-06-16 Thread Jeff Layton

v7:
===
This is the seventh posting of the patchset to revamp the way writeback
errors are tracked and reported.

The main difference from the v6 posting is the removal of the
FS_WB_ERRSEQ flag. That requires a few other incremental patches in the
writeback code to ensure that both error tracking models are handled
in a suitable way.

Also, a bit more cleanup of the metadata writeback codepaths, and some
documentation updates.

Some of these patches have been posted separately, but I'm re-posting
them here to make it clear that they're prerequisites of the later
patches in the series.

This series is based on top of linux-next from a day or so ago. I'd like
to have this picked up by linux-next in the near future so we can get
some more extensive testing with it. Should I just plan to maintain a
topic branch on top of -next and ask Stephen to pick it up?

Background:
===
The basic problem is that we have (for a very long time) tracked and
reported writeback errors based on two flags in the address_space:
AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked,
so only the first caller to check them is able to consume them.

That model is quite unreliable, for several related reasons:

* only the first fsync caller on the inode will see the error. In a
  world of containerized setups, that's no longer viable. Applications
  need to know that their writes are safely stored, and they can
  currently miss seeing errors that they should be aware of when
  they're not.

* there are a lot of internal callers to filemap_fdatawait* and
  filemap_write_and_wait* that clear these errors but then never report
  them to userland in any fashion.

* Some internal callers report writeback errors, but can do so at
  non-sensical times. For instance, we might want to truncate a file,
  which triggers a pagecache flush. If that writeback fails, we might
  report that error to the truncate caller, but a subsequent fsync
  will likely not see it.

* Some internal callers try to reset the error flags after clearing
  them, but that's racy. Another task could check the flags between
  those two events.

Solution:
=
This patchset adds a new datatype called an errseq_t that represents a
sequence of errors. It's a u32, with a field for a POSIX-flavor error
and a counter, managed with atomics. We can sample that value at a
particular point in time, and can later tell whether there have been any
errors since that point.

That allows us to provide traditional check-and-clear fsync semantics
on every open file description in a lightweight fashion. fsync callers
no longer need to coordinate between one another in order to ensure
that errors at fsync time are handled correctly.

Strategy:
=
The aim with this pile is to do the minimum possible to support for
reliable reporting of errors on fsync, without substantially changing
the internals of the filesystems themselves.

Most of the internal calls to filemap_fdatawait are left alone, so all
of the internal error checkers are using the same error handling they
always have. The only real difference here is that we're better
reporting errors at fsync.

I think that we probably will want to eventually convert all of those
internal callers to use errseq_t based reporting, but that can be done
in an incremental fashion in follow-on patchsets.

Testing:

I've primarily been testing this with some new xfstests that I will post
in a separate series. These tests use dm-error fault injection to make
the underlying block device start throwing I/O errors, and then test the
how the filesystem layer reports errors after that.

Jeff Layton (22):
  fs: remove call_fsync helper function
  buffer: use mapping_set_error instead of setting the flag
  fs: check for writeback errors after syncing out buffers in
generic_file_fsync
  buffer: set errors in mapping at the time that the error occurs
  jbd2: don't clear and reset errors after waiting on writeback
  mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: clean up error handling in write_one_page
  fs: always sync metadata in __generic_file_fsync
  lib: add errseq_t type and infrastructure for handling it
  fs: new infrastructure for writeback error handling and reporting
  mm: tracepoints for writeback error events
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  Documentation: flesh out the section in vfs.txt on storing and
reporting writeback errors
  dax: set errors in mapping when writeback fails
  block: convert to errseq_t based writeback error tracking
  ext4: use errseq_t based error handling for reporting data writeback
errors
  fs: add f_md_wb_err field to struct file for tracking metadata errors
  ext4: add more robust reporting of metadata writeback errors
  ext2: convert to errseq_t based writeback error tracking
  xfs: minimal conversion to errseq_t writeback error reporting
  btrfs: minimal

[PATCH v7 01/22] fs: remove call_fsync helper function

2017-06-16 Thread Jeff Layton

Requested-by: Christoph Hellwig <h...@infradead.org>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/sync.c  | 2 +-
 include/linux/fs.h | 6 --
 ipc/shm.c  | 2 +-
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/fs/sync.c b/fs/sync.c
index 11ba023434b1..2a54c1f22035 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -192,7 +192,7 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t 
end, int datasync)
spin_unlock(>i_lock);
mark_inode_dirty_sync(inode);
}
-   return call_fsync(file, start, end, datasync);
+   return file->f_op->fsync(file, start, end, datasync);
 }
 EXPORT_SYMBOL(vfs_fsync_range);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4929a8f28cc3..1a135274b4f8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1740,12 +1740,6 @@ static inline int call_mmap(struct file *file, struct 
vm_area_struct *vma)
return file->f_op->mmap(file, vma);
 }
 
-static inline int call_fsync(struct file *file, loff_t start, loff_t end,
-int datasync)
-{
-   return file->f_op->fsync(file, start, end, datasync);
-}
-
 ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
  unsigned long nr_segs, unsigned long fast_segs,
  struct iovec *fast_pointer,
diff --git a/ipc/shm.c b/ipc/shm.c
index ec5688e98f25..28a444861a8f 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -453,7 +453,7 @@ static int shm_fsync(struct file *file, loff_t start, 
loff_t end, int datasync)
 
if (!sfd->file->f_op->fsync)
return -EINVAL;
-   return call_fsync(sfd->file, start, end, datasync);
+   return sfd->file->f_op->fsync(sfd->file, start, end, datasync);
 }
 
 static long shm_fallocate(struct file *file, int mode, loff_t offset,
-- 
2.13.0

[PATCH v7 07/22] mm: don't TestClearPageError in __filemap_fdatawait_range

2017-06-16 Thread Jeff Layton

The -EIO returned here can end up overriding whatever error is marked in
the address space, and be returned at fsync time, even when there is a
more appropriate error stored in the mapping.

Read errors are also sometimes tracked on a per-page level using
PG_error. Suppose we have a read error on a page, and then that page is
subsequently dirtied by overwriting the whole page. Writeback doesn't
clear PG_error, so we can then end up successfully writing back that
page and still return -EIO on fsync.

Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
return from that is silently discarded. Any subsystem that is relying on
PG_error to report errors during fsync can easily lose writeback errors
due to this. All you need is a stray sync() call to wait for writeback
to complete and you've lost the error.

Since the handling of the PG_error flag is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there are
writeback errors. Change the TestClearPageError call to ClearPageError,
and make __filemap_fdatawait_range a void return function.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 mm/filemap.c | 20 +---
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c349a5d3a34b..21e65c6ef1a0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -388,17 +388,16 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
-   int ret = 0;
 
if (end_byte < start_byte)
-   goto out;
+   return;
 
pagevec_init(, 0);
while ((index <= end) &&
@@ -415,14 +414,11 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
continue;
 
wait_on_page_writeback(page);
-   if (TestClearPageError(page))
-   ret = -EIO;
+   ClearPageError(page);
}
pagevec_release();
cond_resched();
}
-out:
-   return ret;
 }
 
 /**
@@ -442,14 +438,8 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
 {
-   int ret, ret2;
-
-   ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-   ret2 = filemap_check_errors(mapping);
-   if (!ret)
-   ret = ret2;
-
-   return ret;
+   __filemap_fdatawait_range(mapping, start_byte, end_byte);
+   return filemap_check_errors(mapping);
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
-- 
2.13.0

[PATCH v7 14/22] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-16 Thread Jeff Layton

Let's try to make this extra clear for fs authors.

Cc: Jan Kara <j...@suse.cz>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 Documentation/filesystems/vfs.txt | 43 ---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..f3702d5c6f2f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,42 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, they expect that error to be
+reported when fsync is called.  After an error has been reported on one
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure should call
+mapping_set_error to record the error in the address_space when it
+occurs.  Then, at the end of their fsync operation, they should call
+filemap_report_wb_err to ensure that the struct file's error cursor
+has advanced to the correct point in the stream of errors emitted by
+the backing device(s).
 
 struct address_space_operations
 ---
@@ -804,7 +839,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +923,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.13.0

[PATCH v7 11/22] fs: new infrastructure for writeback error handling and reporting

2017-06-16 Thread Jeff Layton

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 drivers/dax/device.c |  1 +
 fs/block_dev.c   |  1 +
 fs/file_table.c  |  1 +
 fs/open.c|  3 +++
 include/linux/fs.h   | 53 
 mm/filemap.c | 38 +
 6 files changed, 97 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 006e657dfcb9..12943d19bfc4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
inode->i_mapping = __dax_inode->i_mapping;
inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index bcd8e16a34e1..dc839f8f0ba5 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1746,6 +1746,7 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
return -ENOMEM;
 
filp->f_mapping = bdev->bd_inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 
return blkdev_get(bdev, filp->f_mode, filp);
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 954d510b765a..72e861a35a7f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
mode,
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
+   file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
if ((mode & FMODE_READ) &&
 likely(fop->read || fop->read_it

[PATCH v7 13/22] mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error

2017-06-16 Thread Jeff Layton

When a writeback error occurs, we want later callers to be able to pick
up that fact when they go to wait on that writeback to complete.
Traditionally, we've used AS_EIO/AS_ENOSPC flags to track that, but
that's problematic since only one "checker" will be informed when an
error occurs.

In later patches, we're going to want to convert many of these callers
to check for errors since a well-defined point in time. For now, ensure
that we can handle both sorts of checks by both setting errors in both
places when there is a writeback failure.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/pagemap.h | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 316a19f6b635..28acc94e0f81 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -28,14 +28,33 @@ enum mapping_flags {
AS_NO_WRITEBACK_TAGS = 5,
 };
 
+/**
+ * mapping_set_error - record a writeback error in the address_space
+ * @mapping - the mapping in which an error should be set
+ * @error - the error to set in the mapping
+ *
+ * When writeback fails in some way, we must record that error so that
+ * userspace can be informed when fsync and the like are called.  We endeavor
+ * to report errors on any file that was open at the time of the error.  Some
+ * internal callers also need to know when writeback errors have occurred.
+ *
+ * When a writeback error occurs, most filesystems will want to call
+ * mapping_set_error to record the error in the mapping so that it can be
+ * reported when the application calls fsync(2).
+ */
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-   if (unlikely(error)) {
-   if (error == -ENOSPC)
-   set_bit(AS_ENOSPC, >flags);
-   else
-   set_bit(AS_EIO, >flags);
-   }
+   if (likely(!error))
+   return;
+
+   /* Record in wb_err for checkers using errseq_t based tracking */
+   filemap_set_wb_err(mapping, error);
+
+   /* Record it in flags for now, for legacy callers */
+   if (error == -ENOSPC)
+   set_bit(AS_ENOSPC, >flags);
+   else
+   set_bit(AS_EIO, >flags);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
-- 
2.13.0

[xfstests PATCH v5 3/5] generic: add a writeback error handling test

2017-06-16 Thread Jeff Layton

I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, it has
the the kernel report them once on every file description that was
open at the time of the error.

This patch adds a test for this new behavior. Basically, open many fds
to the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back.

To do that, I'm adding a new tools/dmerror script that the C program
can use to load the error table from the script. It's also suitable for
setting up, frobbing and tearing down a dmerror device for by-hand testing.

For now, only ext2/3/4 and xfs are whitelisted on this test, since those
filesystems are included in the initial patchset. We can add to that as
we convert filesystems, and eventually make it a more general test.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 .gitignore |   1 +
 common/dmerror |  13 ++-
 doc/auxiliary-programs.txt |  16 
 src/Makefile   |   4 +-
 src/dmerror|  44 +
 src/fsync-err.c| 223 +
 tests/generic/999  |  84 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   1 +
 9 files changed, 382 insertions(+), 7 deletions(-)
 create mode 100755 src/dmerror
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/.gitignore b/.gitignore
index 39664b0a7f53..56e863b2c8dc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -72,6 +72,7 @@
 /src/fs_perms
 /src/fssum
 /src/fstest
+/src/fsync-err
 /src/fsync-tester
 /src/ftrunc
 /src/genhashnames
diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
index 21ef118596b6..bcab453c4335 100644
--- a/doc/auxiliary-programs.txt
+++ b/doc/auxiliary-programs.txt
@@ -16,6 +16,8 @@ note the dependency with:
 Contents:
 
  - af_unix -- Create an AF_UNIX socket
+ - dmerror -- fault injection block device control
+ - fsync-err   -- tests fsync error reporting after failed writeback
  - open_by_handle  -- open_by_handle_at syscall exercise
  - stat_test   -- statx syscall exercise
  - t_dir_type  -- print directory entries and their file type
@@ -30,6 +32,20 @@ af_unix
 
The af_unix program creates an AF_UNIX socket at the given location.
 
+dmerror
+
+   dmerror is a program for creating, destroying and controlling a
+   fault injection device. The device can be set up as initially
+   working and then flip to throwing errors for testing purposes.
+
+fsync-err
+
+   Specialized program for testing how the kernel reports errors that
+   occur during writeback. Works in conjunction with the dmerror script
+   in tools/ to write data to a device, and then force it to fail
+   writeback and test that errors are reported during fsync and cleared
+   afterward.
+
 open_by_handle
 
The open_by_handle program exercises the open_by_handle_at() system
diff --git a/src/Makefile b/src/Makefile
index 6b0e4b022485..2c1b898cebe1 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
-   t_mmap_cow_race t_mmap_fallocate
+   t_mmap_cow_race t_mmap_fallocate fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
@@ -86,7 +86,7 @@ LINKTEST = $(LTLINK) $@.c -o $@ $(CFLAGS) $(LDFLAGS)
 install: default $(addsuffix -install,$

[PATCH v7 22/22] btrfs: minimal conversion to errseq_t writeback error reporting on fsync

2017-06-16 Thread Jeff Layton

Just check and advance the errseq_t in the file before returning.
Internal callers of filemap_* functions are left as-is.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/btrfs/file.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0f102a1b851f..609226beea43 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2017,7 +2017,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_trans_handle *trans;
struct btrfs_log_ctx ctx;
-   int ret = 0;
+   int ret = 0, err;
bool full_sync = 0;
u64 len;
 
@@ -2036,7 +2036,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 */
ret = start_ordered_ops(inode, start, end);
if (ret)
-   return ret;
+   goto out;
 
inode_lock(inode);
atomic_inc(>log_batch);
@@ -2233,6 +2233,9 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
ret = btrfs_end_transaction(trans);
}
 out:
+   err = filemap_report_wb_err(file);
+   if (!ret)
+   ret = err;
return ret > 0 ? -EIO : ret;
 }
 
-- 
2.13.0

[xfstests PATCH v5 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-16 Thread Jeff Layton

For btrfs, we can test how it reports data writeback errors on fsync by
implementing a suggestion from Chris Mason:

Build a filesystem with 2 devices that stripes the data across
both devices, but mirrors metadata across both. Then, make one
of the devices fail and test what it does.

Cc: Chris Mason <c...@fb.com>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 tests/btrfs/999 | 94 +
 tests/btrfs/999.out |  3 ++
 tests/btrfs/group   |  1 +
 3 files changed, 98 insertions(+)
 create mode 100755 tests/btrfs/999
 create mode 100644 tests/btrfs/999.out

diff --git a/tests/btrfs/999 b/tests/btrfs/999
new file mode 100755
index ..903e36526708
--- /dev/null
+++ b/tests/btrfs/999
@@ -0,0 +1,94 @@
+#! /bin/bash
+# FS QA Test No. 999
+#
+# Open a file several times, write to it, fsync on all fds and make sure that
+# they all return 0. Change the device to start throwing errors. Write again
+# on all fds and fsync on all fds. Ensure that we get errors on all of them.
+# Then fsync on all one last time and verify that all return 0.
+#
+#---
+# Copyright (c) 2017, Jeff Layton <jlay...@redhat.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_supported_fs btrfs
+_require_dm_target error
+_require_test_program fsync-err
+_require_test_program dmerror
+
+# bring up dmerror device
+_scratch_unmount
+_dmerror_init
+
+# Replace first device with error-test device
+old_SCRATCH_DEV=$SCRATCH_DEV
+SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
"s#$SCRATCH_DEV#$DMERROR_DEV#"`
+SCRATCH_DEV=$DMERROR_DEV
+
+_require_scratch
+_require_scratch_dev_pool
+
+rm -f $seqres.full
+
+echo "Format and mount"
+
+_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1
+_scratch_mount
+
+# How much do we need to write? We need to hit all of the stripes. btrfs uses
+# a fixed 64k stripesize, so write enough to hit each one
+number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w`
+write_kb=$(($number_of_devices * 64))
+_require_fs_space $SCRATCH_MNT $write_kb
+
+testfile=$SCRATCH_MNT/fsync-err-test
+
+SCRATCH_DEV=$old_SCRATCH_DEV
+$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile
+
+# success, all done
+_dmerror_load_working_table
+
+# fs may be corrupt after this -- attempt to repair it
+_repair_scratch_fs >> $seqres.full
+
+# remove dmerror device
+_dmerror_cleanup
+
+status=0
+exit
diff --git a/tests/btrfs/999.out b/tests/btrfs/999.out
new file mode 100644
index ..2e48492ff6d1
--- /dev/null
+++ b/tests/btrfs/999.out
@@ -0,0 +1,3 @@
+QA output created by 999
+Format and mount
+Test passed!
diff --git a/tests/btrfs/group b/tests/btrfs/group
index be0548796260..c9063f0a4087 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -147,3 +147,4 @@
 142 auto quick
 143 auto quick
 144 auto quick send
+999 auto quick
-- 
2.13.0

[xfstests PATCH v5 4/5] generic: test writeback error handling on dmerror devices

2017-06-16 Thread Jeff Layton

Ensure that we get an error back on all fds when a block device is
open by multiple writers and writeback fails.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 tests/generic/998 | 63 +++
 tests/generic/998.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 66 insertions(+)
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out

diff --git a/tests/generic/998 b/tests/generic/998
new file mode 100755
index ..ef528f18986e
--- /dev/null
+++ b/tests/generic/998
@@ -0,0 +1,63 @@
+#! /bin/bash
+# FS QA Test No. 998
+#
+# Test writeback error handling when writing to block devices via pagecache.
+# See src/fsync-err.c for details of what test actually does.
+#
+#---
+# Copyright (c) 2017, Jeff Layton <jlay...@redhat.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_require_scratch
+_require_dm_target error
+_require_test_program fsync-err
+_require_test_program dmerror
+
+rm -f $seqres.full
+
+_dmerror_init
+
+$here/src/fsync-err -d $here/src/dmerror $DMERROR_DEV
+
+# success, all done
+_dmerror_load_working_table
+_dmerror_cleanup
+_scratch_mkfs > $seqres.full 2>&1
+status=0
+exit
diff --git a/tests/generic/998.out b/tests/generic/998.out
new file mode 100644
index ..658c438820e2
--- /dev/null
+++ b/tests/generic/998.out
@@ -0,0 +1,2 @@
+QA output created by 998
+Test passed!
diff --git a/tests/generic/group b/tests/generic/group
index b56bae8f04f0..9c62ab13ad36 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -442,4 +442,5 @@
 437 auto quick
 438 auto
 439 auto quick punch
+998 blockdev
 999 auto quick
-- 
2.13.0

[PATCH v7 19/22] ext4: add more robust reporting of metadata writeback errors

2017-06-16 Thread Jeff Layton

ext4 uses the blockdev mapping for tracking metadata stored in the
pagecache. Sample its wb_err when opening a file and store that in
the f_md_wb_err field.

Change ext4_sync_file to check for data errors first, and then check the
blockdev mapping for metadata errors afterward.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors when there is a metadata writeback failure.
That's not ideal, but we can possibly improve upon it in the future.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/ext4/dir.c   |  8 ++--
 fs/ext4/file.c  |  5 -
 fs/ext4/fsync.c | 13 +
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 02ce7e7bbdf5..6e505269132c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct 
file * filp)
if (ret < 0)
return ret;
}
-   return dquot_file_open(inode, filp);
+   ret = dquot_file_open(inode, filp);
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 /*
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 03d6259d8662..8ea8ec517da5 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -161,10 +161,23 @@ int ext4_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
ret = err;
}
 out:
+   /*
+* Was there a metadata writeback error since last fsync?
+*
+* FIXME: ext4 tracks metadata with a whole-block device mapping. If
+* there is any sort of metadata writeback error, we'll report an
+* error on all open fds, even ones not associated with this inode.
+*/
+   err = filemap_report_md_wb_err(file,
+   inode->i_sb->s_bdev->bd_inode->i_mapping);
+   if (!ret)
+   ret = err;
+
/* Was there a writeback error of the data since last fsync? */
err = filemap_report_wb_err(file);
if (!ret)
ret = err;
+
trace_ext4_sync_file_exit(inode, ret);
return ret;
 }
-- 
2.13.0

[PATCH v7 15/22] dax: set errors in mapping when writeback fails

2017-06-16 Thread Jeff Layton

Jan Kara's description for this patch is much better than mine, so I'm
quoting it verbatim here:

DAX currently doesn't set errors in the mapping when cache flushing
fails in dax_writeback_mapping_range(). Since this function can get
called only from fsync(2) or sync(2), this is actually as good as it can
currently get since we correctly propagate the error up from
dax_writeback_mapping_range() to filemap_fdatawrite()

However, in the future better writeback error handling will enable us to
properly report these errors on fsync(2) even if there are multiple file
descriptors open against the file or if sync(2) gets called before
fsync(2). So convert DAX to using standard error reporting through the
mapping.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-and-Tested-by: Ross Zwisler <ross.zwis...@linux.intel.com>
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9899f07acf72..c663e8cc2a76 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -856,8 +856,10 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 
ret = dax_writeback_one(bdev, dax_dev, mapping,
indices[i], pvec.pages[i]);
-   if (ret < 0)
+   if (ret < 0) {
+   mapping_set_error(mapping, ret);
goto out;
+   }
}
}
 out:
-- 
2.13.0

[PATCH v7 16/22] block: convert to errseq_t based writeback error tracking

2017-06-16 Thread Jeff Layton

This is a very minimal conversion to errseq_t based error tracking
for raw block device access.

Only real change that is strictly required is that we must
unconditionally call filemap_report_wb_err in blkdev_fsync.
That ensures that the file's errseq_t is always advanced to
the latest value in the mapping.

Note that there are internal callers that call sync_blockdev
and the like that are not affected by this. They'll continue
to use the AS_EIO/AS_ENOSPC flags for error reporting like
they always have for now.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/block_dev.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index dc839f8f0ba5..9e8e13b097ef 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -625,11 +625,11 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t 
end, int datasync)
 {
struct inode *bd_inode = bdev_file_inode(filp);
struct block_device *bdev = I_BDEV(bd_inode);
-   int error;
+   int error, wberr;

error = filemap_write_and_wait_range(filp->f_mapping, start, end);
if (error)
-   return error;
+   goto out;
 
/*
 * There is no need to serialise calls to blkdev_issue_flush with
@@ -640,6 +640,10 @@ int blkdev_fsync(struct file *filp, loff_t start, loff_t 
end, int datasync)
if (error == -EOPNOTSUPP)
error = 0;
 
+out:
+   wberr = filemap_report_wb_err(filp);
+   if (!error)
+   error = wberr;
return error;
 }
 EXPORT_SYMBOL(blkdev_fsync);
-- 
2.13.0

[xfstests PATCH v5 2/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs

2017-06-16 Thread Jeff Layton

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 common/rc | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index 57001b47a8b7..43e160e91360 100644
--- a/common/rc
+++ b/common/rc
@@ -840,7 +840,16 @@ _scratch_mkfs()
mkfs_cmd="$MKFS_BTRFS_PROG"
mkfs_filter="cat"
;;
-   ext2|ext3)
+   ext3)
+   mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
+   mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
+
+   # put journal on separate device?
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $MKFS_OPTIONS $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
+   ;;
+   ext2)
mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
;;
-- 
2.13.0

[PATCH v7 18/22] fs: add f_md_wb_err field to struct file for tracking metadata errors

2017-06-16 Thread Jeff Layton

Some filesystems keep a different mapping for metadata writeback. Add a
second errseq_t to struct file for tracking metadata writeback errors.
Also add a new function for checking a mapping of the caller's choosing
vs. the f_md_wb_err value.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/fs.h |  3 +++
 include/trace/events/filemap.h | 23 ++-
 mm/filemap.c   | 40 +++-
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8980e5ce2063..8f6319981b8b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -872,6 +872,7 @@ struct file {
struct list_headf_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+   errseq_tf_md_wb_err; /* optional metadata wb error 
tracking */
 } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2525,6 +2526,8 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern int __must_check filemap_report_md_wb_err(struct file *file,
+   struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 2af66920f267..6e0d78c01a2e 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -79,12 +79,11 @@ TRACE_EVENT(filemap_set_wb_err,
 );
 
 TRACE_EVENT(filemap_report_wb_err,
-   TP_PROTO(struct file *file, errseq_t old),
+   TP_PROTO(struct address_space *mapping, errseq_t old, errseq_t 
new),
 
-   TP_ARGS(file, old),
+   TP_ARGS(mapping, old, new),
 
TP_STRUCT__entry(
-   __field(struct file *, file);
__field(unsigned long, i_ino)
__field(dev_t, s_dev)
__field(errseq_t, old)
@@ -92,20 +91,18 @@ TRACE_EVENT(filemap_report_wb_err,
),
 
TP_fast_assign(
-   __entry->file = file;
-   __entry->i_ino = file->f_mapping->host->i_ino;
-   if (file->f_mapping->host->i_sb)
-   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   __entry->i_ino = mapping->host->i_ino;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
else
-   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->s_dev = mapping->host->i_rdev;
__entry->old = old;
-   __entry->new = file->f_wb_err;
+   __entry->new = new;
),
 
-   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
-   __entry->file, MAJOR(__entry->s_dev),
-   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
-   __entry->new)
+   TP_printk("dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->old, __entry->new)
 );
 #endif /* _TRACE_FILEMAP_H */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 879623032016..b0aef0a1ec46 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -584,27 +584,49 @@ EXPORT_SYMBOL(__filemap_set_wb_err);
  * value is protected by the f_lock since we must ensure that it reflects
  * the latest value swapped in for this file descriptor.
  */
-int filemap_report_wb_err(struct file *file)
+static int __filemap_report_wb_err(errseq_t *cursor, spinlock_t *lock,
+   struct address_space *mapping)
 {
int err = 0;
-   errseq_t old = READ_ONCE(file->f_wb_err);
-   struct address_space *mapping = file->f_mapping;
+   errseq_t old = READ_ONCE(*cursor);
 
/* Locklessly handle the common case where nothing has changed */
if (errseq_check(>wb_err, old)) {
/* Something changed, must use slow path */
-   spin_lock(>f_lock);
-   old = file->f_wb_err;
-   err = errseq_check_and_advance(>wb_err,
-   >f_wb_err);
-   trace_filemap_report_wb_err(file, old);
-   spin_unlock(>f_lock);
+   spin_lock(lock);
+

[xfstests PATCH v5 0/5] new tests for writeback error reporting behavior

2017-06-16 Thread Jeff Layton

The main changes in this set from the last are:

- add a btrfs/999.out file
- use _supported_fs to whitelist fs' on which the tests should run
- fix the ext3/4 mount option handling when creating journal device
- ensure that dmerror is installed on make install

These tests are intended to test the new writeback error reporting
introduced by this kernel patchset:

[PATCH v7 00/22] fs: enhanced writeback error reporting with errseq_t (pile 
#1)

It adds 3 new xfstests for testing kernel behavior when writeback from
the pagecache fails: one generic filesystem test, one test for raw block
devices and one test for btrfs.

The tests work with dmerror to make data writeback from the pagecache
fail, and then tests how the kernel reports errors afterward.

xfs, ext2/3/4 and btrfs all pass on a kernel with the patchset above. "Bare"
block devices also work correctly.

Jeff Layton (5):
  ext4: allow ext4 to use $SCRATCH_LOGDEV
  ext3: allow it to put journal on a separate device when doing
scratch_mkfs
  generic: add a writeback error handling test
  generic: test writeback error handling on dmerror devices
  btrfs: make a btrfs version of writeback error reporting test

 .gitignore |   1 +
 common/dmerror |  13 ++-
 common/rc  |  14 ++-
 doc/auxiliary-programs.txt |  16 
 src/Makefile   |   4 +-
 src/dmerror|  44 +
 src/fsync-err.c| 223 +
 tests/btrfs/999|  94 +++
 tests/btrfs/999.out|   3 +
 tests/btrfs/group  |   1 +
 tests/generic/998  |  63 +
 tests/generic/998.out  |   2 +
 tests/generic/999  |  84 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   2 +
 15 files changed, 559 insertions(+), 8 deletions(-)
 create mode 100755 src/dmerror
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/btrfs/999
 create mode 100644 tests/btrfs/999.out
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

-- 
2.13.0

[PATCH v7 21/22] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-16 Thread Jeff Layton

Just check and advance the data errseq_t in struct file before
before returning from fsync on normal files. Internal filemap_*
callers are left as-is.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/xfs/xfs_file.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a0958a14..bc3b1575e8db 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -134,7 +134,7 @@ xfs_file_fsync(
struct inode*inode = file->f_mapping->host;
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
-   int error = 0;
+   int error = 0, err2;
int log_flushed = 0;
xfs_lsn_t   lsn = 0;
 
@@ -142,10 +142,12 @@ xfs_file_fsync(
 
error = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (error)
-   return error;
+   goto out;
 
-   if (XFS_FORCED_SHUTDOWN(mp))
-   return -EIO;
+   if (XFS_FORCED_SHUTDOWN(mp)) {
+   error = -EIO;
+   goto out;
+   }
 
xfs_iflags_clear(ip, XFS_ITRUNCATED);
 
@@ -197,6 +199,11 @@ xfs_file_fsync(
mp->m_logdev_targp == mp->m_ddev_targp)
xfs_blkdev_issue_flush(mp->m_ddev_targp);
 
+out:
+   err2 = filemap_report_wb_err(file);
+   if (!error)
+   error = err2;
+
return error;
 }
 
-- 
2.13.0

[xfstests PATCH v5 1/5] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-06-16 Thread Jeff Layton

The writeback error handling test requires that you put the journal on a
separate device. This allows us to use dmerror to simulate data
writeback failure, without affecting the journal.

xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
up the ext4 code so that it can do the same thing when _scratch_mkfs is
called.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
---
 common/rc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/common/rc b/common/rc
index ff1b75c9cd25..57001b47a8b7 100644
--- a/common/rc
+++ b/common/rc
@@ -679,6 +679,9 @@ _scratch_mkfs_ext4()
local tmp=`mktemp`
local mkfs_status
 
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $MKFS_OPTIONS $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
 
_scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
1>$tmp.mkfsstd
mkfs_status=$?
-- 
2.13.0

[PATCH v7 17/22] ext4: use errseq_t based error handling for reporting data writeback errors

2017-06-16 Thread Jeff Layton

Add a call to filemap_report_wb_err at the end of ext4_sync_file. This
will ensure that we check and advance the errseq_t in the file, which
allows us to track and report errors on all open fds when they occur.

Note that metadata writeback errors are not yet reported on all fds at
this point. That will be added in a later patch.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/ext4/fsync.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 9d549608fd30..03d6259d8662 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -100,8 +100,10 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
tid_t commit_tid;
bool needs_barrier = false;
 
-   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
-   return -EIO;
+   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb {
+   ret = -EIO;
+   goto out;
+   }
 
J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -126,7 +128,8 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
 
ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (ret)
-   return ret;
+   goto out;
+
/*
 * data=writeback,ordered:
 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -152,12 +155,16 @@ int ext4_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
needs_barrier = true;
ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
-   issue_flush:
+issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
}
 out:
+   /* Was there a writeback error of the data since last fsync? */
+   err = filemap_report_wb_err(file);
+   if (!ret)
+   ret = err;
trace_ext4_sync_file_exit(inode, ret);
return ret;
 }
-- 
2.13.0

[PATCH v7 20/22] ext2: convert to errseq_t based writeback error tracking

2017-06-16 Thread Jeff Layton

Set the flag to indicate that we want new-style data writeback error
handling.

This means that we need to override the open routines for files and
directories so that we can sample the bdev wb_err at open.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/ext2/dir.c  |  8 
 fs/ext2/file.c | 26 +-
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index e2709695b177..6e476c9929f8 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -713,6 +713,13 @@ int ext2_empty_dir (struct inode * inode)
return 0;
 }
 
+static int ext2_dir_open(struct inode *inode, struct file *file)
+{
+   /* Sample blockdev mapping errseq_t for metadata writeback */
+   file->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return 0;
+}
+
 const struct file_operations ext2_dir_operations = {
.llseek = generic_file_llseek,
.read   = generic_read_dir,
@@ -721,5 +728,6 @@ const struct file_operations ext2_dir_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = ext2_compat_ioctl,
 #endif
+   .open   = ext2_dir_open,
.fsync  = ext2_fsync,
 };
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ed00e7ae0ef3..fd44539f6fce 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -172,17 +172,23 @@ static int ext2_release_file (struct inode * inode, 
struct file * filp)
 
 int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
-   int ret;
+   int ret, ret2;
struct super_block *sb = file->f_mapping->host->i_sb;
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO) {
+   if (ret == -EIO)
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
-   ret = -EIO;
-   }
+
+   ret2 = filemap_report_wb_err(file);
+   if (ret == 0)
+   ret = ret2;
+
+   ret2 = filemap_report_md_wb_err(file, mapping);
+   if (ret == 0)
+   ret = ret2;
return ret;
 }
 
@@ -204,6 +210,16 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
return generic_file_write_iter(iocb, from);
 }
 
+static int ext2_file_open(struct inode *inode, struct file *file)
+{
+   int ret;
+
+   ret = dquot_file_open(inode, file);
+   if (likely(ret == 0))
+   file->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
+}
+
 const struct file_operations ext2_file_operations = {
.llseek = generic_file_llseek,
.read_iter  = ext2_file_read_iter,
@@ -213,7 +229,7 @@ const struct file_operations ext2_file_operations = {
.compat_ioctl   = ext2_compat_ioctl,
 #endif
.mmap   = ext2_file_mmap,
-   .open   = dquot_file_open,
+   .open   = ext2_file_open,
.release= ext2_release_file,
.fsync  = ext2_fsync,
.get_unmapped_area = thp_get_unmapped_area,
-- 
2.13.0

[PATCH v7 12/22] mm: tracepoints for writeback error events

2017-06-16 Thread Jeff Layton

To enable that, make __errseq_set return the value that it was set to
when we exit the loop. Take heed that that value is not suitable as a
later "since" value, as it will not have been marked seen.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/errseq.h |  2 +-
 include/linux/fs.h |  5 +++-
 include/trace/events/filemap.h | 55 ++
 lib/errseq.c   | 20 ++-
 mm/filemap.c   | 13 +-
 5 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index 0d2555f310cd..9e0d444ac88d 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -5,7 +5,7 @@
 
 typedef u32errseq_t;
 
-void __errseq_set(errseq_t *eseq, int err);
+errseq_t __errseq_set(errseq_t *eseq, int err);
 static inline void errseq_set(errseq_t *eseq, int err)
 {
/* Optimize for the common case of no error */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 09d511771b25..8980e5ce2063 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2525,6 +2525,7 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
@@ -2544,7 +2545,9 @@ extern int __must_check filemap_report_wb_err(struct file 
*file);
  */
 static inline void filemap_set_wb_err(struct address_space *mapping, int err)
 {
-   errseq_set(>wb_err, err);
+   /* Fastpath for common case of no error */
+   if (unlikely(err))
+   __filemap_set_wb_err(mapping, err);
 }
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 42febb6bc1d5..2af66920f267 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 
@@ -52,6 +53,60 @@ DEFINE_EVENT(mm_filemap_op_page_cache, 
mm_filemap_add_to_page_cache,
TP_ARGS(page)
);
 
+TRACE_EVENT(filemap_set_wb_err,
+   TP_PROTO(struct address_space *mapping, errseq_t eseq),
+
+   TP_ARGS(mapping, eseq),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, errseq)
+   ),
+
+   TP_fast_assign(
+   __entry->i_ino = mapping->host->i_ino;
+   __entry->errseq = eseq;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = mapping->host->i_rdev;
+   ),
+
+   TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->errseq)
+);
+
+TRACE_EVENT(filemap_report_wb_err,
+   TP_PROTO(struct file *file, errseq_t old),
+
+   TP_ARGS(file, old),
+
+   TP_STRUCT__entry(
+   __field(struct file *, file);
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, old)
+   __field(errseq_t, new)
+   ),
+
+   TP_fast_assign(
+   __entry->file = file;
+   __entry->i_ino = file->f_mapping->host->i_ino;
+   if (file->f_mapping->host->i_sb)
+   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->old = old;
+   __entry->new = file->f_wb_err;
+   ),
+
+   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   __entry->file, MAJOR(__entry->s_dev),
+   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
+   __entry->new)
+);
 #endif /* _TRACE_FILEMAP_H */
 
 /* This part must be outside protection */
diff --git a/lib/errseq.c b/lib/errseq.c
index d129c0611c1f..009972d3000c 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -52,10 +52,14 @@
  *
  * Most callers will want to use the errseq_set inline wrapper to efficiently
  * handle the common case where err is 0.
+ *
+ * We do return an errseq_t here, primarily for deb

[PATCH v7 08/22] mm: clean up error handling in write_one_page

2017-06-16 Thread Jeff Layton

Don't try to check PageError since that's potentially racy and not
necessarily going to be set after writepage errors out.

Instead, check the mapping for an error after writepage returns.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 mm/page-writeback.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 36c62fda96bc..64b75bd996a4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2371,14 +2371,13 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
  *
  * The page must be locked by the caller and will be unlocked upon return.
  *
- * write_one_page() returns a negative error code if I/O failed. Note that
- * the address_space is not marked for error. The caller must do this if
- * needed.
+ * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
+ * function returns.
  */
 int write_one_page(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   int ret = 0;
+   int ret = 0, ret2;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 1,
@@ -2391,15 +2390,15 @@ int write_one_page(struct page *page)
if (clear_page_dirty_for_io(page)) {
get_page(page);
ret = mapping->a_ops->writepage(page, );
-   if (ret == 0) {
+   if (ret == 0)
wait_on_page_writeback(page);
-   if (PageError(page))
-   ret = -EIO;
-   }
put_page(page);
} else {
unlock_page(page);
}
+
+   if (!ret)
+   ret = filemap_check_errors(mapping);
return ret;
 }
 EXPORT_SYMBOL(write_one_page);
-- 
2.13.0

[PATCH v7 06/22] mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails

2017-06-16 Thread Jeff Layton

filemap_write_and_wait{_range} will return an error if writeback
initiation fails, but won't clear errors in the address_space. This is
particularly problematic on DAX, as it's effectively synchronous. Ensure
that we clear the AS_EIO/AS_ENOSPC flags when filemap_fdatawrite returns
an error.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 mm/filemap.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 37f286df7c95..c349a5d3a34b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -517,6 +517,9 @@ int filemap_write_and_wait(struct address_space *mapping)
int err2 = filemap_fdatawait(mapping);
if (!err)
err = err2;
+   } else {
+   /* Clear any previously stored errors */
+   filemap_check_errors(mapping);
}
} else {
err = filemap_check_errors(mapping);
@@ -551,6 +554,9 @@ int filemap_write_and_wait_range(struct address_space 
*mapping,
lstart, lend);
if (!err)
err = err2;
+   } else {
+   /* Clear any previously stored errors */
+   filemap_check_errors(mapping);
}
} else {
err = filemap_check_errors(mapping);
-- 
2.13.0

[PATCH v7 05/22] jbd2: don't clear and reset errors after waiting on writeback

2017-06-16 Thread Jeff Layton

Resetting this flag is almost certainly racy, and will be problematic
with some coming changes.

Make filemap_fdatawait_keep_errors return int, but not clear the flag(s).
Have jbd2 call it instead of filemap_fdatawait and don't attempt to
re-set the error flag if it fails.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/jbd2/commit.c   | 15 +++
 include/linux/fs.h |  2 +-
 mm/filemap.c   | 16 ++--
 3 files changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b6b194ec1b4f..502110540598 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -263,18 +263,9 @@ static int journal_finish_inode_data_buffers(journal_t 
*journal,
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(>j_list_lock);
-   err = filemap_fdatawait(jinode->i_vfs_inode->i_mapping);
-   if (err) {
-   /*
-* Because AS_EIO is cleared by
-* filemap_fdatawait_range(), set it again so
-* that user process can get -EIO from fsync().
-*/
-   mapping_set_error(jinode->i_vfs_inode->i_mapping, -EIO);
-
-   if (!ret)
-   ret = err;
-   }
+   err = 
filemap_fdatawait_keep_errors(jinode->i_vfs_inode->i_mapping);
+   if (!ret)
+   ret = err;
spin_lock(>j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
smp_mb();
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a135274b4f8..1b1233a1db1e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2509,7 +2509,7 @@ extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
-extern void filemap_fdatawait_keep_errors(struct address_space *);
+extern int filemap_fdatawait_keep_errors(struct address_space *);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
diff --git a/mm/filemap.c b/mm/filemap.c
index b9e870600572..37f286df7c95 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -311,6 +311,16 @@ int filemap_check_errors(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_check_errors);
 
+static int filemap_check_and_keep_errors(struct address_space *mapping)
+{
+   /* Check for outstanding write errors */
+   if (test_bit(AS_EIO, >flags))
+   return -EIO;
+   if (test_bit(AS_ENOSPC, >flags))
+   return -ENOSPC;
+   return 0;
+}
+
 /**
  * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
  * @mapping:   address space structure to write
@@ -455,15 +465,17 @@ EXPORT_SYMBOL(filemap_fdatawait_range);
  * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
  * fsfreeze(8)
  */
-void filemap_fdatawait_keep_errors(struct address_space *mapping)
+int filemap_fdatawait_keep_errors(struct address_space *mapping)
 {
loff_t i_size = i_size_read(mapping->host);
 
if (i_size == 0)
-   return;
+   return 0;
 
__filemap_fdatawait_range(mapping, 0, i_size - 1);
+   return filemap_check_and_keep_errors(mapping);
 }
+EXPORT_SYMBOL(filemap_fdatawait_keep_errors);
 
 /**
  * filemap_fdatawait - wait for all under-writeback pages to complete
-- 
2.13.0

[PATCH v7 02/22] buffer: use mapping_set_error instead of setting the flag

2017-06-16 Thread Jeff Layton

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Matthew Wilcox <mawil...@microsoft.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 44172d11efae..7b4f4bfde91e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -482,7 +482,7 @@ static void __remove_assoc_queue(struct buffer_head *bh)
list_del_init(>b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
if (buffer_write_io_error(bh))
-   set_bit(AS_EIO, >b_assoc_map->flags);
+   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
-- 
2.13.0

[PATCH v7 03/22] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-06-16 Thread Jeff Layton

ext2 currently does a test+clear of the AS_EIO flag, which is
is problematic for some coming changes.

What we really need to do instead is call filemap_check_errors
in __generic_file_fsync after syncing out the buffers. That
will be sufficient for this case, and help other callers detect
these errors properly as well.

With that, we don't need to twiddle it in ext2.

Suggested-by: Jan Kara <j...@suse.cz>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Matthew Wilcox <mawil...@microsoft.com>
---
 fs/ext2/file.c | 2 +-
 fs/libfs.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b21891a6bfca..ed00e7ae0ef3 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, 
int datasync)
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO || test_and_clear_bit(AS_EIO, >flags)) {
+   if (ret == -EIO) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
diff --git a/fs/libfs.c b/fs/libfs.c
index a04395334bb1..1dec90819366 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 
 out:
inode_unlock(inode);
-   return ret;
+   err = filemap_check_errors(inode->i_mapping);
+   return ret ? ret : err;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.13.0

[PATCH v7 04/22] buffer: set errors in mapping at the time that the error occurs

2017-06-16 Thread Jeff Layton

I noticed on xfs that I could still sometimes get back an error on fsync
on a fd that was opened after the error condition had been cleared.

The problem is that the buffer code sets the write_io_error flag and
then later checks that flag to set the error in the mapping. That flag
perisists for quite a while however. If the file is later opened with
O_TRUNC, the buffers will then be invalidated and the mapping's error
set such that a subsequent fsync will return error. I think this is
incorrect, as there was no writeback between the open and fsync.

Add a new mark_buffer_write_io_error operation that sets the flag and
the error in the mapping at the same time. Replace all calls to
set_buffer_write_io_error with mark_buffer_write_io_error, and remove
the places that check this flag in order to set the error in the
mapping.

This sets the error in the mapping earlier, at the time that it's first
detected.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/buffer.c | 20 +---
 fs/gfs2/lops.c  |  2 +-
 include/linux/buffer_head.h |  1 +
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7b4f4bfde91e..4d5d03b42e11 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -178,7 +178,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int 
uptodate)
set_buffer_uptodate(bh);
} else {
buffer_io_error(bh, ", lost sync page write");
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -352,8 +352,7 @@ void end_buffer_async_write(struct buffer_head *bh, int 
uptodate)
set_buffer_uptodate(bh);
} else {
buffer_io_error(bh, ", lost async page write");
-   mapping_set_error(page->mapping, -EIO);
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
}
@@ -481,8 +480,6 @@ static void __remove_assoc_queue(struct buffer_head *bh)
 {
list_del_init(>b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
-   if (buffer_write_io_error(bh))
-   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
@@ -1181,6 +1178,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
+void mark_buffer_write_io_error(struct buffer_head *bh)
+{
+   set_buffer_write_io_error(bh);
+   /* FIXME: do we need to set this in both places? */
+   if (bh->b_page && bh->b_page->mapping)
+   mapping_set_error(bh->b_page->mapping, -EIO);
+   if (bh->b_assoc_map)
+   mapping_set_error(bh->b_assoc_map, -EIO);
+}
+EXPORT_SYMBOL(mark_buffer_write_io_error);
+
 /*
  * Decrement a buffer_head's reference count.  If all buffers against a page
  * have zero reference count, are clean and unlocked, and if the page is clean
@@ -3266,8 +3274,6 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 
bh = head;
do {
-   if (buffer_write_io_error(bh) && page->mapping)
-   mapping_set_error(page->mapping, -EIO);
if (buffer_busy(bh))
goto failed;
bh = bh->b_this_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 885d36e7a29f..1a9c2c08c1a1 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -182,7 +182,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, 
struct bio_vec *bvec,
bh = bh->b_this_page;
do {
if (error)
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
unlock_buffer(bh);
next = bh->b_this_page;
size -= bh->b_size;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index bd029e52ef5e..e0abeba3ced7 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -149,6 +149,7 @@ void buffer_check_dirty_writeback(struct page *page,
  */
 
 void mark_buffer_dirty(struct buffer_head *bh);
+void mark_buffer_write_io_error(struct buffer_head *bh);
 void init_buffer(struct buffer_head *, bh_end_io_t *, void *);
 void touch_buffer(struct buffer_head *bh);
 void set_bh_page(struct buffer_head *bh,
-- 
2.13.0

Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-15 Thread Jeff Layton

On Thu, 2017-06-15 at 07:57 -0700, Christoph Hellwig wrote:
> On Thu, Jun 15, 2017 at 06:42:12AM -0400, Jeff Layton wrote:
> > Correct.
> > 
> > But if there is a data writeback error, should we report an error on all
> > open fds at that time (like we will for fsync)?
> 
> We should in theory, but I don't see how to properly do it.  In addition
> sync_file_range just can't be used for data integrity to start with, so
> I don't think it's worth it.  At some point we should add a proper
> fsync_range syscall, though.

filemap_report_wb_err will always return 0 if the inode never has
mapping_set_error called on it. So, I think we should be able to do it
there once we get all of the fs' converted over. That'll have to happen
at the end of the series however.

-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-15 Thread Jeff Layton

On Thu, 2017-06-15 at 01:22 -0700, Christoph Hellwig wrote:
> On Wed, Jun 14, 2017 at 01:24:43PM -0400, Jeff Layton wrote:
> > In this smaller set, it's only really used for DAX.
> 
> DAX only is implemented by three filesystems, please just fix them
> up in one go.
> 

Ok.

> > sync_file_range: ->fsync isn't called directly there, and I think we
> > probably want similar semantics to fsync() for it
> 
> sync_file_range is only supposed to sync data, so it should not call
> ->fsync.
> 

Correct.

But if there is a data writeback error, should we report an error on all
open fds at that time (like we will for fsync)?

I think we probably do want to do that, but like you say...there is no
file op for sync_file_range. It'll need some way to figure out what sort
of error tracking is in play.

> > JBD2: will try to re-set the error after clearing it with
> > filemap_fdatawait. That's problematic with the new infrastructure so we
> > need some way to avoid it.
> 
> JBD2 only has two users, please fix them up in one go.

I came up with a fix yesterday that makes the flag unnecessary there.

Thanks,
-- 
Jeff Layton <jlay...@redhat.com>

Re: [xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-14 Thread Jeff Layton

On Tue, 2017-06-13 at 16:40 +0800, Eryu Guan wrote:
> On Mon, Jun 12, 2017 at 08:42:13AM -0400, Jeff Layton wrote:
> > Make a new btrfs/999 test that works the way Chris Mason suggested:
> > 
> > Build a filesystem with 2 devices that stripes the data across
> > both devices, but mirrors metadata across both. Then, make one
> > of the devices fail and see how fsync is handled.
> > 
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > ---
> >  tests/btrfs/999   | 93 
> > +++
> 
> Missing btrfs/999.out file
> 
> >  tests/btrfs/group |  1 +
> >  2 files changed, 94 insertions(+)
> >  create mode 100755 tests/btrfs/999
> > 
> > diff --git a/tests/btrfs/999 b/tests/btrfs/999
> > new file mode 100755
> > index ..84031cc0d913
> > --- /dev/null
> > +++ b/tests/btrfs/999
> > @@ -0,0 +1,93 @@
> > +#! /bin/bash
> > +# FS QA Test No. 999
> > +#
> > +# Open a file several times, write to it, fsync on all fds and make sure 
> > that
> > +# they all return 0. Change the device to start throwing errors. Write 
> > again
> > +# on all fds and fsync on all fds. Ensure that we get errors on all of 
> > them.
> > +# Then fsync on all one last time and verify that all return 0.
> > +#
> > +#---
> > +# Copyright (c) 2017, Jeff Layton <jlay...@redhat.com>
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#---
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +rm -rf $tmp.* $testdir
> > +_dmerror_cleanup
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +. ./common/dmerror
> > +
> > +# real QA test starts here
> > +_supported_os Linux
> > +_require_dm_target error
> > +_require_test_program fsync-err
> > +_require_test_program dmerror
> > +
> > +# bring up dmerror device
> > +_scratch_unmount
> > +_dmerror_init
> > +
> > +# Replace first device with error-test device
> > +old_SCRATCH_DEV=$SCRATCH_DEV
> > +SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
> > "s#$SCRATCH_DEV#$DMERROR_DEV#"`
> > +SCRATCH_DEV=$DMERROR_DEV
> > +
> > +_require_scratch
> > +_require_scratch_dev_pool
> 
> Need "_require_scratch_dev_pool_equal_size" too, since test creates
> raid1 profile for metadata.
> 
> Thanks,
> Eryu
> 

Is this really needed?

I've been running this test on btrfs with devices that are not of equal
size, and it seems to work just fine. The test doesn't write a lot of
data (just a few megs at most), so I don't think we'll run out of space
unless you have some really small devices in there.

> > +
> > +rm -f $seqres.full
> > +
> > +echo "Format and mount"
> > +
> > +_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1
> > +_scratch_mount
> > +
> > +# How much do we need to write? We need to hit all of the stripes. btrfs 
> > uses
> > +# a fixed 64k stripesize, so write enough to hit each one
> > +number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w`
> > +write_kb=$(($number_of_devices * 64))
> > +_require_fs_space $SCRATCH_MNT $write_kb
> > +
> > +testfile=$SCRATCH_MNT/fsync-err-test
> > +
> > +SCRATCH_DEV=$old_SCRATCH_DEV
> > +$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile
> > +
> > +# success, all done
> > +_dmerror_load_working_table
> > +
> > +# fs may be corrupt after this -- attempt to repair it
> > +_repair_scratch_fs >> $seqres.full
> > +
> > +# remove dmerror device
> > +_dmerror_cleanup
> > +
> > +status=0
> > +exit
> > diff --git a/tests/btrfs/group b/tests/btrfs/group
> > index 6f19619e877c..8dbdfbfe29fd 100644
> > --- a/tests/btrfs/group
> > +++ b/tests/btrfs/group
> > @@ -145,3 +145,4 @@
> >  141 auto quick
> >  142 auto quick
> >  143 auto quick
> > +999 auto quick
> > -- 
> > 2.13.0
> > 

-- 
Jeff Layton <jlay...@redhat.com>

Re: [xfstests PATCH v4 0/5] new tests for writeback error reporting behavior

2017-06-13 Thread Jeff Layton

On Tue, 2017-06-13 at 16:32 +0800, Eryu Guan wrote:
> On Mon, Jun 12, 2017 at 08:42:08AM -0400, Jeff Layton wrote:
> > v4: respin set based on Eryu's comments
> > 
> > These tests are companion tests to the patchset I recently posted with
> > the cover letter:
> > 
> > [PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t 
> > (pile #1)
> > 
> > This set just adds 3 new xfstests to test writeback behavior. One generic
> > filesystem test, one test for raw block devices, and one test for btrfs.
> > The tests work with dmerror to ensure that writeback fails, and then
> > tests how the kernel reports errors afterward.
> > 
> > xfs, ext2/3/4 and btrfs all pass on a kernel with the patchset above.
> 
> xfs, ext3/4 passed generic/999, btrfs passed btrfs/999 (with some local
> modifications, see reply to btrfs/999), but ext2 (using ext4 driver) and
> btrfs failed generic/999 on my host. (See test log at the end of mail.)
> 
> In the ext2 case, this test requires an external log device to run, but
> ext2 has no journal at all, I wonder if we should _notrun on ext2.
> 

Technically it doesn't _require_ an external log device, it just won't
pass on ext3/4 and xfs without one. Not sure how to convey that in a
_require directive here.

If you remove _require_logdev, it should pass on ext2 under the ext4.ko
driver as well.

ext2.ko still has issues though as the block device ends up getting
flagged with errors twice.

> btrfs doesn't support external log device either, it should not run this
> generic test either.
> 

Agreed. We just want it to run btrfs/999

> I think _require_logdev() should be updated too, to do a check on
> $FSTYP, only allows filesystems that have external log device support to
> continue to run.
> 

Ok.

> > 
> > The one comment I couldn't really address from earlier review is that
> > we don't have a great way for xfstests to tell what sort of error
> > reporting behavior it should expect from the running kernel. That
> > makes it difficult to tell whether failure is expected during a given
> > run.
> > 
> > Maybe that's OK though and we should just let unconverted filesystems
> > fail this test?
> 
> If there's really no good way to tell if current fs supports this new
> behavior, I think this is fine, strictly speaking, it's not a new
> feature anyway.
> 
> Thanks,
> Eryu
> 
> P.S. ext2 and btrfs failure in generic/999 run (I renumbered it to
> generic/441)
> 
> [root@ibm-x3550m3-05 xfstests]# ./check generic/441 generic/442
> FSTYP -- ext2
> PLATFORM  -- Linux/x86_64 ibm-x3550m3-05 4.12.0-rc4.jlayton+
> MKFS_OPTIONS  -- /dev/sdc2
> MOUNT_OPTIONS -- -o acl,user_xattr -o context=system_u:object_r:root_t:s0 
> /dev/sdc2 /mnt/testarea/scratch
> 
> generic/441 4s ... - output mismatch (see 
> /root/xfstests/results//generic/441.out.bad)
> --- tests/generic/441.out   2017-06-13 15:52:09.928413126 +0800
> +++ /root/xfstests/results//generic/441.out.bad 2017-06-13 
> 16:21:41.414112226 +0800
> @@ -1,3 +1,3 @@
>  QA output created by 441
>  Format and mount
> -Test passed!
> +Third fsync on fd[0] failed: Input/output error
> ...
> (Run 'diff -u tests/generic/441.out 
> /root/xfstests/results//generic/441.out.bad'  to see the entire diff)
> 

That means that it returned EIO on fsync when the error should have been
cleared. I'll see if I can reproduce it on my setup.

> [root@ibm-x3550m3-05 xfstests]# ./check generic/441 generic/442
> FSTYP -- btrfs
> PLATFORM  -- Linux/x86_64 ibm-x3550m3-05 4.12.0-rc4.jlayton+
> MKFS_OPTIONS  -- /dev/sdc2
> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/sdc2 
> /mnt/testarea/scratch
> 
> generic/441 4s ... - output mismatch (see 
> /root/xfstests/results//generic/441.out.bad)
> --- tests/generic/441.out   2017-06-13 15:52:09.928413126 +0800
> +++ /root/xfstests/results//generic/441.out.bad 2017-06-13 
> 16:25:17.483273992 +0800
> @@ -1,3 +1,3 @@
>  QA output created by 441
>  Format and mount
> -Test passed!
> +Third fsync on fd[0] failed: Read-only file system
> ...
> (Run 'diff -u tests/generic/441.out 
> /root/xfstests/results//generic/441.out.bad'  to see the entire diff)

That's expected on btrfs. It needs btrfs/999 test to do this correctly.

> > 
> > Jeff Layton (5):
> >   generic: add a writeback error handling test
> >   ext4: allow ext4 to use $SCRATCH_LOGDEV
> >   generic: test writeback error handling on dmerror devices
> >   ext3: allow it to put journal on a separate device when doing
> >

Re: [PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-13 Thread Jeff Layton

On Mon, 2017-06-12 at 21:30 -0700, Darrick J. Wong wrote:
> On Mon, Jun 12, 2017 at 08:23:15AM -0400, Jeff Layton wrote:
> > Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t
> > based error reporting. Internal filemap_* calls are left as-is for now.
> > 
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > ---
> >  fs/xfs/xfs_super.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 455a575f101d..28d3be187025 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
> > .name   = "xfs",
> > .mount  = xfs_fs_mount,
> > .kill_sb= kill_block_super,
> > -   .fs_flags   = FS_REQUIRES_DEV,
> > +   .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,
> 
> Huh?  Why are there two patches with the same subject line?  And this
> same bit of code too?  Or ... 11/13, 11/20?  What's going on here?
> 
> 
> 
> --D

Oh my -- sorry about that. I ended up with two different interleaved
patchsets. The /20 series is the one I meant to send.

Just ignore these for now though, as I'll be sending a v7 (at least) to
address HCH's comments.
-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-13 Thread Jeff Layton

On Mon, 2017-06-12 at 05:45 -0700, Christoph Hellwig wrote:
> On Mon, Jun 12, 2017 at 08:23:06AM -0400, Jeff Layton wrote:
> > Add a new FS_WB_ERRSEQ flag to the fstype. Later patches will set and
> > key off of that to decide what behavior should be used.
> 
> Please invert this so that only file systems that keep the old semantics
> need a flag.

That's definitely what I want for the endgame here. My plan was to add
this flag for now, and then eventually reverse it (or drop it) once all
or most filesystems are converted.

We can do it that way from the get-go if you like. It'll mean tossing in
 a patch add this flag to all filesystems that have an fsync operation
and that use the pagecache, and then gradually remove it from them as we
convert them.

Which method do you prefer?
-- 
Jeff Layton <jlay...@redhat.com>

[xfstests PATCH v4 5/5] btrfs: make a btrfs version of writeback error reporting test

2017-06-12 Thread Jeff Layton

Make a new btrfs/999 test that works the way Chris Mason suggested:

Build a filesystem with 2 devices that stripes the data across
both devices, but mirrors metadata across both. Then, make one
of the devices fail and see how fsync is handled.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 tests/btrfs/999   | 93 +++
 tests/btrfs/group |  1 +
 2 files changed, 94 insertions(+)
 create mode 100755 tests/btrfs/999

diff --git a/tests/btrfs/999 b/tests/btrfs/999
new file mode 100755
index ..84031cc0d913
--- /dev/null
+++ b/tests/btrfs/999
@@ -0,0 +1,93 @@
+#! /bin/bash
+# FS QA Test No. 999
+#
+# Open a file several times, write to it, fsync on all fds and make sure that
+# they all return 0. Change the device to start throwing errors. Write again
+# on all fds and fsync on all fds. Ensure that we get errors on all of them.
+# Then fsync on all one last time and verify that all return 0.
+#
+#---
+# Copyright (c) 2017, Jeff Layton <jlay...@redhat.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_require_dm_target error
+_require_test_program fsync-err
+_require_test_program dmerror
+
+# bring up dmerror device
+_scratch_unmount
+_dmerror_init
+
+# Replace first device with error-test device
+old_SCRATCH_DEV=$SCRATCH_DEV
+SCRATCH_DEV_POOL=`echo $SCRATCH_DEV_POOL | perl -pe 
"s#$SCRATCH_DEV#$DMERROR_DEV#"`
+SCRATCH_DEV=$DMERROR_DEV
+
+_require_scratch
+_require_scratch_dev_pool
+
+rm -f $seqres.full
+
+echo "Format and mount"
+
+_scratch_pool_mkfs "-d raid0 -m raid1" > $seqres.full 2>&1
+_scratch_mount
+
+# How much do we need to write? We need to hit all of the stripes. btrfs uses
+# a fixed 64k stripesize, so write enough to hit each one
+number_of_devices=`echo $SCRATCH_DEV_POOL | wc -w`
+write_kb=$(($number_of_devices * 64))
+_require_fs_space $SCRATCH_MNT $write_kb
+
+testfile=$SCRATCH_MNT/fsync-err-test
+
+SCRATCH_DEV=$old_SCRATCH_DEV
+$here/src/fsync-err -b $(($write_kb * 1024)) -d $here/src/dmerror $testfile
+
+# success, all done
+_dmerror_load_working_table
+
+# fs may be corrupt after this -- attempt to repair it
+_repair_scratch_fs >> $seqres.full
+
+# remove dmerror device
+_dmerror_cleanup
+
+status=0
+exit
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 6f19619e877c..8dbdfbfe29fd 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -145,3 +145,4 @@
 141 auto quick
 142 auto quick
 143 auto quick
+999 auto quick
-- 
2.13.0

[xfstests PATCH v4 1/5] generic: add a writeback error handling test

2017-06-12 Thread Jeff Layton

I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, I aim
to make the kernel report them once on every file description.

This patch adds a test for the new behavior. Basically, open many fds
to the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back.

To do that, I'm adding a new tools/dmerror script that the C program
can use to load the error table. For now, that's all it can do, but
we can fill it out with other commands as necessary.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 .gitignore |   1 +
 common/dmerror |  13 ++-
 doc/auxiliary-programs.txt |  16 
 src/Makefile   |   2 +-
 src/dmerror|  44 +
 src/fsync-err.c| 223 +
 tests/generic/999  |  77 
 tests/generic/999.out  |   3 +
 tests/generic/group|   1 +
 9 files changed, 374 insertions(+), 6 deletions(-)
 create mode 100755 src/dmerror
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

diff --git a/.gitignore b/.gitignore
index 39664b0a7f53..56e863b2c8dc 100644
--- a/.gitignore
+++ b/.gitignore
@@ -72,6 +72,7 @@
 /src/fs_perms
 /src/fssum
 /src/fstest
+/src/fsync-err
 /src/fsync-tester
 /src/ftrunc
 /src/genhashnames
diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
index 21ef118596b6..bcab453c4335 100644
--- a/doc/auxiliary-programs.txt
+++ b/doc/auxiliary-programs.txt
@@ -16,6 +16,8 @@ note the dependency with:
 Contents:
 
  - af_unix -- Create an AF_UNIX socket
+ - dmerror -- fault injection block device control
+ - fsync-err   -- tests fsync error reporting after failed writeback
  - open_by_handle  -- open_by_handle_at syscall exercise
  - stat_test   -- statx syscall exercise
  - t_dir_type  -- print directory entries and their file type
@@ -30,6 +32,20 @@ af_unix
 
The af_unix program creates an AF_UNIX socket at the given location.
 
+dmerror
+
+   dmerror is a program for creating, destroying and controlling a
+   fault injection device. The device can be set up as initially
+   working and then flip to throwing errors for testing purposes.
+
+fsync-err
+
+   Specialized program for testing how the kernel reports errors that
+   occur during writeback. Works in conjunction with the dmerror script
+   in tools/ to write data to a device, and then force it to fail
+   writeback and test that errors are reported during fsync and cleared
+   afterward.
+
 open_by_handle
 
The open_by_handle program exercises the open_by_handle_at() system
diff --git a/src/Makefile b/src/Makefile
index 6b0e4b022485..abd0fff34a64 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
-   t_mmap_cow_race t_mmap_fallocate
+   t_mmap_cow_race t_mmap_fallocate fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/dmerror b/src/dmerror
new file mode 100755
index ..4aaf682ee5f9
--- /dev/null
+++ b/src/dmerror
@@ -0,0 +1,44 @@
+#!/bin/bash
+#-------
+# Copyright (c) 2017, Jeff Layton <jlay...@redhat.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under

[xfstests PATCH v4 2/5] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-06-12 Thread Jeff Layton

The writeback error handling test requires that you put the journal on a
separate device. This allows us to use dmerror to simulate data
writeback failure, without affecting the journal.

xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
up the ext4 code so that it can do the same thing when _scratch_mkfs is
called.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
---
 common/rc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/common/rc b/common/rc
index 87e6ff08b18d..08807ac7c22a 100644
--- a/common/rc
+++ b/common/rc
@@ -676,6 +676,9 @@ _scratch_mkfs_ext4()
local tmp=`mktemp`
local mkfs_status
 
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd $MKFS_OPTIONS -J device=$SCRATCH_LOGDEV"
 
_scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
1>$tmp.mkfsstd
mkfs_status=$?
-- 
2.13.0

[xfstests PATCH v4 3/5] generic: test writeback error handling on dmerror devices

2017-06-12 Thread Jeff Layton

Ensure that we get an error back on all fds when a block device is
open by multiple writers and writeback fails.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 tests/generic/998 | 64 +++
 tests/generic/998.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 67 insertions(+)
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out

diff --git a/tests/generic/998 b/tests/generic/998
new file mode 100755
index ..4e8379988252
--- /dev/null
+++ b/tests/generic/998
@@ -0,0 +1,64 @@
+#! /bin/bash
+# FS QA Test No. 998
+#
+# Test writeback error handling when writing to block devices via pagecache.
+# See src/fsync-err.c for details of what test actually does.
+#
+#---
+# Copyright (c) 2017, Jeff Layton <jlay...@redhat.com>
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+cd /
+rm -rf $tmp.* $testdir
+_dmerror_cleanup
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmerror
+
+# real QA test starts here
+_supported_os Linux
+_require_scratch
+_require_logdev
+_require_dm_target error
+_require_test_program fsync-err
+_require_test_program dmerror
+
+rm -f $seqres.full
+
+_dmerror_init
+
+$here/src/fsync-err -d $here/src/dmerror $DMERROR_DEV
+
+# success, all done
+_dmerror_load_working_table
+_dmerror_cleanup
+_scratch_mkfs > $seqres.full 2>&1
+status=0
+exit
diff --git a/tests/generic/998.out b/tests/generic/998.out
new file mode 100644
index ..658c438820e2
--- /dev/null
+++ b/tests/generic/998.out
@@ -0,0 +1,2 @@
+QA output created by 998
+Test passed!
diff --git a/tests/generic/group b/tests/generic/group
index b56bae8f04f0..9c62ab13ad36 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -442,4 +442,5 @@
 437 auto quick
 438 auto
 439 auto quick punch
+998 blockdev
 999 auto quick
-- 
2.13.0

[xfstests PATCH v4 4/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs

2017-06-12 Thread Jeff Layton

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 common/rc | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index 08807ac7c22a..46b890cbff6a 100644
--- a/common/rc
+++ b/common/rc
@@ -832,7 +832,16 @@ _scratch_mkfs()
mkfs_cmd="$MKFS_BTRFS_PROG"
mkfs_filter="cat"
;;
-   ext2|ext3)
+   ext3)
+   mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
+   mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
+
+   # put journal on separate device?
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd $MKFS_OPTIONS -J device=$SCRATCH_LOGDEV"
+   ;;
+   ext2)
mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
;;
-- 
2.13.0

[xfstests PATCH v4 0/5] new tests for writeback error reporting behavior

2017-06-12 Thread Jeff Layton

v4: respin set based on Eryu's comments

These tests are companion tests to the patchset I recently posted with
the cover letter:

[PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t (pile 
#1)

This set just adds 3 new xfstests to test writeback behavior. One generic
filesystem test, one test for raw block devices, and one test for btrfs.
The tests work with dmerror to ensure that writeback fails, and then
tests how the kernel reports errors afterward.

xfs, ext2/3/4 and btrfs all pass on a kernel with the patchset above.

The one comment I couldn't really address from earlier review is that
we don't have a great way for xfstests to tell what sort of error
reporting behavior it should expect from the running kernel. That
makes it difficult to tell whether failure is expected during a given
run.

Maybe that's OK though and we should just let unconverted filesystems
fail this test?

Jeff Layton (5):
  generic: add a writeback error handling test
  ext4: allow ext4 to use $SCRATCH_LOGDEV
  generic: test writeback error handling on dmerror devices
  ext3: allow it to put journal on a separate device when doing
scratch_mkfs
  btrfs: make a btrfs version of writeback error reporting test

 .gitignore |   1 +
 common/dmerror |  13 ++-
 common/rc  |  14 ++-
 doc/auxiliary-programs.txt |  16 
 src/Makefile   |   2 +-
 src/dmerror|  44 +
 src/fsync-err.c| 223 +
 tests/btrfs/999|  93 +++
 tests/btrfs/group  |   1 +
 tests/generic/998  |  64 +
 tests/generic/998.out  |   2 +
 tests/generic/999  |  77 
 tests/generic/999.out  |   3 +
 tests/generic/group|   2 +
 14 files changed, 548 insertions(+), 7 deletions(-)
 create mode 100755 src/dmerror
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/btrfs/999
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out

-- 
2.13.0

[PATCH v6 10/20] mm: tracepoints for writeback error events

2017-06-12 Thread Jeff Layton

To enable that, make __errseq_set return the value that it was set to
when we exit the loop. Take heed that that value is not suitable as a
later "since" value, as it will not have been marked seen.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/errseq.h |  2 +-
 include/linux/fs.h |  5 +++-
 include/trace/events/filemap.h | 55 ++
 lib/errseq.c   | 20 ++-
 mm/filemap.c   | 13 +-
 5 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
index 0d2555f310cd..9e0d444ac88d 100644
--- a/include/linux/errseq.h
+++ b/include/linux/errseq.h
@@ -5,7 +5,7 @@
 
 typedef u32errseq_t;
 
-void __errseq_set(errseq_t *eseq, int err);
+errseq_t __errseq_set(errseq_t *eseq, int err);
 static inline void errseq_set(errseq_t *eseq, int err)
 {
/* Optimize for the common case of no error */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e57321b7ee2a..6cd87887430b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2530,6 +2530,7 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
  * filemap_set_wb_err - set a writeback error on an address_space
@@ -2549,7 +2550,9 @@ extern int __must_check filemap_report_wb_err(struct file 
*file);
  */
 static inline void filemap_set_wb_err(struct address_space *mapping, int err)
 {
-   errseq_set(>wb_err, err);
+   /* Fastpath for common case of no error */
+   if (unlikely(err))
+   __filemap_set_wb_err(mapping, err);
 }
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 42febb6bc1d5..2af66920f267 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
 
@@ -52,6 +53,60 @@ DEFINE_EVENT(mm_filemap_op_page_cache, 
mm_filemap_add_to_page_cache,
TP_ARGS(page)
);
 
+TRACE_EVENT(filemap_set_wb_err,
+   TP_PROTO(struct address_space *mapping, errseq_t eseq),
+
+   TP_ARGS(mapping, eseq),
+
+   TP_STRUCT__entry(
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, errseq)
+   ),
+
+   TP_fast_assign(
+   __entry->i_ino = mapping->host->i_ino;
+   __entry->errseq = eseq;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = mapping->host->i_rdev;
+   ),
+
+   TP_printk("dev=%d:%d ino=0x%lx errseq=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->errseq)
+);
+
+TRACE_EVENT(filemap_report_wb_err,
+   TP_PROTO(struct file *file, errseq_t old),
+
+   TP_ARGS(file, old),
+
+   TP_STRUCT__entry(
+   __field(struct file *, file);
+   __field(unsigned long, i_ino)
+   __field(dev_t, s_dev)
+   __field(errseq_t, old)
+   __field(errseq_t, new)
+   ),
+
+   TP_fast_assign(
+   __entry->file = file;
+   __entry->i_ino = file->f_mapping->host->i_ino;
+   if (file->f_mapping->host->i_sb)
+   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   else
+   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->old = old;
+   __entry->new = file->f_wb_err;
+   ),
+
+   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   __entry->file, MAJOR(__entry->s_dev),
+   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
+   __entry->new)
+);
 #endif /* _TRACE_FILEMAP_H */
 
 /* This part must be outside protection */
diff --git a/lib/errseq.c b/lib/errseq.c
index d129c0611c1f..009972d3000c 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -52,10 +52,14 @@
  *
  * Most callers will want to use the errseq_set inline wrapper to efficiently
  * handle the common case where err is 0.
+ *
+ * We do return an errseq_t here, primarily for deb

[PATCH v6 02/20] buffer: use mapping_set_error instead of setting the flag

2017-06-12 Thread Jeff Layton

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Matthew Wilcox <mawil...@microsoft.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 fs/buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 161be58c5cb0..4be8b914a222 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -482,7 +482,7 @@ static void __remove_assoc_queue(struct buffer_head *bh)
list_del_init(>b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
if (buffer_write_io_error(bh))
-   set_bit(AS_EIO, >b_assoc_map->flags);
+   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
-- 
2.13.0

[PATCH v6 05/20] mm: don't TestClearPageError in __filemap_fdatawait_range

2017-06-12 Thread Jeff Layton

The -EIO returned here can end up overriding whatever error is marked in
the address space, and be returned at fsync time, even when there is a
more appropriate error stored in the mapping.

Read errors are also sometimes tracked on a per-page level using
PG_error. Suppose we have a read error on a page, and then that page is
subsequently dirtied by overwriting the whole page. Writeback doesn't
clear PG_error, so we can then end up successfully writing back that
page and still return -EIO on fsync.

Worse yet, PG_error is cleared during a sync() syscall, but the -EIO
return from that is silently discarded. Any subsystem that is relying on
PG_error to report errors during fsync can easily lose writeback errors
due to this. All you need is a stray sync() call to wait for writeback
to complete and you've lost the error.

Since the handling of the PG_error flag is somewhat inconsistent across
subsystems, let's just rely on marking the address space when there are
writeback errors. Change the TestClearPageError call to ClearPageError,
and make __filemap_fdatawait_range a void return function.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 mm/filemap.c | 20 +---
 1 file changed, 5 insertions(+), 15 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f1be573a5e6..1de71753de28 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -376,17 +376,16 @@ int filemap_flush(struct address_space *mapping)
 }
 EXPORT_SYMBOL(filemap_flush);
 
-static int __filemap_fdatawait_range(struct address_space *mapping,
+static void __filemap_fdatawait_range(struct address_space *mapping,
 loff_t start_byte, loff_t end_byte)
 {
pgoff_t index = start_byte >> PAGE_SHIFT;
pgoff_t end = end_byte >> PAGE_SHIFT;
struct pagevec pvec;
int nr_pages;
-   int ret = 0;
 
if (end_byte < start_byte)
-   goto out;
+   return;
 
pagevec_init(, 0);
while ((index <= end) &&
@@ -403,14 +402,11 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
continue;
 
wait_on_page_writeback(page);
-   if (TestClearPageError(page))
-   ret = -EIO;
+   ClearPageError(page);
}
pagevec_release();
cond_resched();
}
-out:
-   return ret;
 }
 
 /**
@@ -430,14 +426,8 @@ static int __filemap_fdatawait_range(struct address_space 
*mapping,
 int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
loff_t end_byte)
 {
-   int ret, ret2;
-
-   ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-   ret2 = filemap_check_errors(mapping);
-   if (!ret)
-   ret = ret2;
-
-   return ret;
+   __filemap_fdatawait_range(mapping, start_byte, end_byte);
+   return filemap_check_errors(mapping);
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
-- 
2.13.0

[PATCH v6 01/20] mm: fix mapping_set_error call in me_pagecache_dirty

2017-06-12 Thread Jeff Layton

The error code should be negative. Since this ends up in the default
case anyway, this is harmless, but it's less confusing to negate it.
Also, later patches will require a negative error code here.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Ross Zwisler <ross.zwis...@linux.intel.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Matthew Wilcox <mawil...@microsoft.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 342fac9ba89b..55bc61791fe1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -684,7 +684,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long 
pfn)
 * the first EIO, but we're not worse than other parts
 * of the kernel.
 */
-   mapping_set_error(mapping, EIO);
+   mapping_set_error(mapping, -EIO);
}
 
return me_pagecache_clean(p, pfn);
-- 
2.13.0

[PATCH v6 03/20] fs: check for writeback errors after syncing out buffers in generic_file_fsync

2017-06-12 Thread Jeff Layton

ext2 currently does a test+clear of the AS_EIO flag, which is
is problematic for some coming changes.

What we really need to do instead is call filemap_check_errors
in __generic_file_fsync after syncing out the buffers. That
will be sufficient for this case, and help other callers detect
these errors properly as well.

With that, we don't need to twiddle it in ext2.

Suggested-by: Jan Kara <j...@suse.cz>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Matthew Wilcox <mawil...@microsoft.com>
---
 fs/ext2/file.c | 2 +-
 fs/libfs.c | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b21891a6bfca..ed00e7ae0ef3 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -177,7 +177,7 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, 
int datasync)
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
 
ret = generic_file_fsync(file, start, end, datasync);
-   if (ret == -EIO || test_and_clear_bit(AS_EIO, >flags)) {
+   if (ret == -EIO) {
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
   "detected IO error when writing metadata buffers");
diff --git a/fs/libfs.c b/fs/libfs.c
index a04395334bb1..1dec90819366 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -991,7 +991,8 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 
 out:
inode_unlock(inode);
-   return ret;
+   err = filemap_check_errors(inode->i_mapping);
+   return ret ? ret : err;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.13.0

[PATCH v6 04/20] buffer: set errors in mapping at the time that the error occurs

2017-06-12 Thread Jeff Layton

I noticed on xfs that I could still sometimes get back an error on fsync
on a fd that was opened after the error condition had been cleared.

The problem is that the buffer code sets the write_io_error flag and
then later checks that flag to set the error in the mapping. That flag
perisists for quite a while however. If the file is later opened with
O_TRUNC, the buffers will then be invalidated and the mapping's error
set such that a subsequent fsync will return error. I think this is
incorrect, as there was no writeback between the open and fsync.

Add a new mark_buffer_write_io_error operation that sets the flag and
the error in the mapping at the same time. Replace all calls to
set_buffer_write_io_error with mark_buffer_write_io_error, and remove
the places that check this flag in order to set the error in the
mapping.

This sets the error in the mapping earlier, at the time that it's first
detected.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 fs/buffer.c | 20 +---
 fs/gfs2/lops.c  |  2 +-
 include/linux/buffer_head.h |  1 +
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4be8b914a222..b946149e8214 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -178,7 +178,7 @@ void end_buffer_write_sync(struct buffer_head *bh, int 
uptodate)
set_buffer_uptodate(bh);
} else {
buffer_io_error(bh, ", lost sync page write");
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
}
unlock_buffer(bh);
@@ -352,8 +352,7 @@ void end_buffer_async_write(struct buffer_head *bh, int 
uptodate)
set_buffer_uptodate(bh);
} else {
buffer_io_error(bh, ", lost async page write");
-   mapping_set_error(page->mapping, -EIO);
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
}
@@ -481,8 +480,6 @@ static void __remove_assoc_queue(struct buffer_head *bh)
 {
list_del_init(>b_assoc_buffers);
WARN_ON(!bh->b_assoc_map);
-   if (buffer_write_io_error(bh))
-   mapping_set_error(bh->b_assoc_map, -EIO);
bh->b_assoc_map = NULL;
 }
 
@@ -1181,6 +1178,17 @@ void mark_buffer_dirty(struct buffer_head *bh)
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
+void mark_buffer_write_io_error(struct buffer_head *bh)
+{
+   set_buffer_write_io_error(bh);
+   /* FIXME: do we need to set this in both places? */
+   if (bh->b_page && bh->b_page->mapping)
+   mapping_set_error(bh->b_page->mapping, -EIO);
+   if (bh->b_assoc_map)
+   mapping_set_error(bh->b_assoc_map, -EIO);
+}
+EXPORT_SYMBOL(mark_buffer_write_io_error);
+
 /*
  * Decrement a buffer_head's reference count.  If all buffers against a page
  * have zero reference count, are clean and unlocked, and if the page is clean
@@ -3279,8 +3287,6 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 
bh = head;
do {
-   if (buffer_write_io_error(bh) && page->mapping)
-   mapping_set_error(page->mapping, -EIO);
if (buffer_busy(bh))
goto failed;
bh = bh->b_this_page;
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index b1f9144b42c7..cd7857ab1a6a 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -182,7 +182,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp, 
struct bio_vec *bvec,
bh = bh->b_this_page;
do {
if (error)
-   set_buffer_write_io_error(bh);
+   mark_buffer_write_io_error(bh);
unlock_buffer(bh);
next = bh->b_this_page;
size -= bh->b_size;
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index bd029e52ef5e..e0abeba3ced7 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -149,6 +149,7 @@ void buffer_check_dirty_writeback(struct page *page,
  */
 
 void mark_buffer_dirty(struct buffer_head *bh);
+void mark_buffer_write_io_error(struct buffer_head *bh);
 void init_buffer(struct buffer_head *, bh_end_io_t *, void *);
 void touch_buffer(struct buffer_head *bh);
 void set_bh_page(struct buffer_head *bh,
-- 
2.13.0

[PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-12 Thread Jeff Layton

Now that we have new infrastructure for handling writeback errors using
errseq_t, we need to convert the existing code to use it. We could
attempt to retrofit the old interfaces on top of the new, but there is
a conceptual disconnect here in the case of internal callers that
invoke filemap_fdatawait and the like.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can report the
latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

This means that we need to touch each filesystem that calls
filemap_check_errors in some fashion and ensure that we establish sane
"since" values for those callers. But...some code is shared between
filesystems and needs to be able to handle both error tracking schemes.

Add a new FS_WB_ERRSEQ flag to the fstype. Later patches will set and
key off of that to decide what behavior should be used.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/fs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6cd87887430b..17ba6284ab14 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2023,6 +2023,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_WB_ERRSEQ   16  /* errseq_t writeback err tracking */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
-- 
2.13.0

[PATCH v6 17/20] fs: add f_md_wb_err field to struct file for tracking metadata errors

2017-06-12 Thread Jeff Layton

Some filesystems keep a different mapping for metadata writeback. Add a
second errseq_t to struct file for tracking metadata writeback errors.
Also add a new function for checking a mapping of the caller's choosing
vs. the f_md_wb_err value.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/fs.h |  3 +++
 include/trace/events/filemap.h | 23 ++-
 mm/filemap.c   | 40 +++-
 3 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index ef3feeec80b2..e366835c93b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -871,6 +871,7 @@ struct file {
struct list_headf_tfile_llink;
 #endif /* #ifdef CONFIG_EPOLL */
struct address_space*f_mapping;
+   errseq_tf_md_wb_err; /* optional metadata wb error 
tracking */
 } __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */
 
 struct file_handle {
@@ -2525,6 +2526,8 @@ extern int filemap_fdatawrite_range(struct address_space 
*mapping,
 extern int filemap_check_errors(struct address_space *mapping);
 
 extern int __must_check filemap_report_wb_err(struct file *file);
+extern int __must_check filemap_report_md_wb_err(struct file *file,
+   struct address_space *mapping);
 extern void __filemap_set_wb_err(struct address_space *mapping, int err);
 
 /**
diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 2af66920f267..6e0d78c01a2e 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -79,12 +79,11 @@ TRACE_EVENT(filemap_set_wb_err,
 );
 
 TRACE_EVENT(filemap_report_wb_err,
-   TP_PROTO(struct file *file, errseq_t old),
+   TP_PROTO(struct address_space *mapping, errseq_t old, errseq_t 
new),
 
-   TP_ARGS(file, old),
+   TP_ARGS(mapping, old, new),
 
TP_STRUCT__entry(
-   __field(struct file *, file);
__field(unsigned long, i_ino)
__field(dev_t, s_dev)
__field(errseq_t, old)
@@ -92,20 +91,18 @@ TRACE_EVENT(filemap_report_wb_err,
),
 
TP_fast_assign(
-   __entry->file = file;
-   __entry->i_ino = file->f_mapping->host->i_ino;
-   if (file->f_mapping->host->i_sb)
-   __entry->s_dev = 
file->f_mapping->host->i_sb->s_dev;
+   __entry->i_ino = mapping->host->i_ino;
+   if (mapping->host->i_sb)
+   __entry->s_dev = mapping->host->i_sb->s_dev;
else
-   __entry->s_dev = file->f_mapping->host->i_rdev;
+   __entry->s_dev = mapping->host->i_rdev;
__entry->old = old;
-   __entry->new = file->f_wb_err;
+   __entry->new = new;
),
 
-   TP_printk("file=%p dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
-   __entry->file, MAJOR(__entry->s_dev),
-   MINOR(__entry->s_dev), __entry->i_ino, __entry->old,
-   __entry->new)
+   TP_printk("dev=%d:%d ino=0x%lx old=0x%x new=0x%x",
+   MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
+   __entry->i_ino, __entry->old, __entry->new)
 );
 #endif /* _TRACE_FILEMAP_H */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e19ea0bf12..ef0ff6b87759 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -564,27 +564,49 @@ EXPORT_SYMBOL(__filemap_set_wb_err);
  * value is protected by the f_lock since we must ensure that it reflects
  * the latest value swapped in for this file descriptor.
  */
-int filemap_report_wb_err(struct file *file)
+static int __filemap_report_wb_err(errseq_t *cursor, spinlock_t *lock,
+   struct address_space *mapping)
 {
int err = 0;
-   errseq_t old = READ_ONCE(file->f_wb_err);
-   struct address_space *mapping = file->f_mapping;
+   errseq_t old = READ_ONCE(*cursor);
 
/* Locklessly handle the common case where nothing has changed */
if (errseq_check(>wb_err, old)) {
/* Something changed, must use slow path */
-   spin_lock(>f_lock);
-   old = file->f_wb_err;
-   err = errseq_check_and_advance(>wb_err,
-   >f_wb_err);
-   trace_filemap_report_wb_err(file, old);
-   spin_unlock(>f_lock);
+   spin_lock(lock);
+

[PATCH v6 09/20] fs: new infrastructure for writeback error handling and reporting

2017-06-12 Thread Jeff Layton

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 drivers/dax/device.c |  1 +
 fs/block_dev.c   |  1 +
 fs/file_table.c  |  1 +
 fs/open.c|  3 +++
 include/linux/fs.h   | 53 
 mm/filemap.c | 38 +
 6 files changed, 97 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 006e657dfcb9..12943d19bfc4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
inode->i_mapping = __dax_inode->i_mapping;
inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..4d62fe771587 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
return -ENOMEM;
 
filp->f_mapping = bdev->bd_inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 
return blkdev_get(bdev, filp->f_mode, filp);
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 954d510b765a..72e861a35a7f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
mode,
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
+   file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
if ((mode & FMODE_READ) &&
 likely(fop->read || fop->read_iter))
mode |= FMOD

[PATCH v6 11/20] mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error

2017-06-12 Thread Jeff Layton

When a writeback error occurs, we want later callers to be able to pick
up that fact when they go to wait on that writeback to complete.
Traditionally, we've used AS_EIO/AS_ENOSPC flags to track that, but
that's problematic since only one "checker" will be informed when an
error occurs.

In later patches, we're going to want to convert many of these callers
to check for errors since a well-defined point in time. For now, ensure
that we can handle both sorts of checks by both setting errors in both
places when there is a writeback failure.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/pagemap.h | 31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 316a19f6b635..28acc94e0f81 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -28,14 +28,33 @@ enum mapping_flags {
AS_NO_WRITEBACK_TAGS = 5,
 };
 
+/**
+ * mapping_set_error - record a writeback error in the address_space
+ * @mapping - the mapping in which an error should be set
+ * @error - the error to set in the mapping
+ *
+ * When writeback fails in some way, we must record that error so that
+ * userspace can be informed when fsync and the like are called.  We endeavor
+ * to report errors on any file that was open at the time of the error.  Some
+ * internal callers also need to know when writeback errors have occurred.
+ *
+ * When a writeback error occurs, most filesystems will want to call
+ * mapping_set_error to record the error in the mapping so that it can be
+ * reported when the application calls fsync(2).
+ */
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-   if (unlikely(error)) {
-   if (error == -ENOSPC)
-   set_bit(AS_ENOSPC, >flags);
-   else
-   set_bit(AS_EIO, >flags);
-   }
+   if (likely(!error))
+   return;
+
+   /* Record in wb_err for checkers using errseq_t based tracking */
+   filemap_set_wb_err(mapping, error);
+
+   /* Record it in flags for now, for legacy callers */
+   if (error == -ENOSPC)
+   set_bit(AS_ENOSPC, >flags);
+   else
+   set_bit(AS_EIO, >flags);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
-- 
2.13.0

[PATCH v6 08/20] lib: add errseq_t type and infrastructure for handling it

2017-06-12 Thread Jeff Layton

An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: NeilBrown <ne...@suse.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 include/linux/errseq.h |  19 +
 lib/Makefile   |   2 +-
 lib/errseq.c   | 200 +
 3 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index ..0d2555f310cd
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+
+/* See lib/errseq.c for more info */
+
+typedef u32errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+   /* Optimize for the common case of no error */
+   if (unlikely(err))
+   __errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 0166fbc0fa81..519782d9ca3f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o refcount.o usercopy.o
+once.o refcount.o usercopy.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index ..d129c0611c1f
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,200 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since a previous
+ * point where it was sampled.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value. That value
+ * can later be used to tell whether any new errors have occurred since that
+ * sampling was done.
+ *
+ * Note that there is a risk of collisions if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT   ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT)
+
+/* The lowest bit of the counter */
+#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always

[PATCH v6 12/13] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-12 Thread Jeff Layton

Just check and advance the data errseq_t in struct file before
before returning from fsync on normal files. Internal filemap_*
callers are left as-is.

We also set the FS_WB_ERRSEQ flag just for completeness sake.
Not much is really using it at this point.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/xfs/xfs_file.c  | 15 +++
 fs/xfs/xfs_super.c |  2 +-
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 5fb5a0958a14..bc3b1575e8db 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -134,7 +134,7 @@ xfs_file_fsync(
struct inode*inode = file->f_mapping->host;
struct xfs_inode*ip = XFS_I(inode);
struct xfs_mount*mp = ip->i_mount;
-   int error = 0;
+   int error = 0, err2;
int log_flushed = 0;
xfs_lsn_t   lsn = 0;
 
@@ -142,10 +142,12 @@ xfs_file_fsync(
 
error = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (error)
-   return error;
+   goto out;
 
-   if (XFS_FORCED_SHUTDOWN(mp))
-   return -EIO;
+   if (XFS_FORCED_SHUTDOWN(mp)) {
+   error = -EIO;
+   goto out;
+   }
 
xfs_iflags_clear(ip, XFS_ITRUNCATED);
 
@@ -197,6 +199,11 @@ xfs_file_fsync(
mp->m_logdev_targp == mp->m_ddev_targp)
xfs_blkdev_issue_flush(mp->m_ddev_targp);
 
+out:
+   err2 = filemap_report_wb_err(file);
+   if (!error)
+   error = err2;
+
return error;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575f101d..28d3be187025 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
.name   = "xfs",
.mount  = xfs_fs_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.13.0

[PATCH v6 14/20] dax: set errors in mapping when writeback fails

2017-06-12 Thread Jeff Layton

Jan Kara's description for this patch is much better than mine, so I'm
quoting it verbatim here:

-8<-
DAX currently doesn't set errors in the mapping when cache flushing
fails in dax_writeback_mapping_range(). Since this function can get
called only from fsync(2) or sync(2), this is actually as good as it can
currently get since we correctly propagate the error up from
dax_writeback_mapping_range() to filemap_fdatawrite()

However, in the future better writeback error handling will enable us to
properly report these errors on fsync(2) even if there are multiple file
descriptors open against the file or if sync(2) gets called before
fsync(2). So convert DAX to using standard error reporting through the
mapping.
-8<-

For now, only do this when the FS_WB_ERRSEQ flag is set. The
AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when
writeback initiation fails, only when we discover an error after waiting
on writeback to complete, so we only want to do this with errseq_t based
error handling to prevent seeing duplicate errors on fsync.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-and-Tested-by: Ross Zwisler <ross.zwis...@linux.intel.com>
---
 fs/dax.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 2a6889b3585f..ba3b17eefcfc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -856,8 +856,24 @@ int dax_writeback_mapping_range(struct address_space 
*mapping,
 
ret = dax_writeback_one(bdev, dax_dev, mapping,
indices[i], pvec.pages[i]);
-   if (ret < 0)
+   if (ret < 0) {
+   /*
+* For fs' that use errseq_t based error
+* tracking, we must call mapping_set_error
+* here to ensure that fsync on all open fds
+* get back an error. Doing this with the old
+* wb error tracking infrastructure is
+* problematic though, as DAX writeback is
+* synchronous, and the error flags are not
+* cleared when initiation fails, only when
+* it fails after the write has been submitted
+* to the backing store.
+*/
+   if (mapping->host->i_sb->s_type->fs_flags &
+   FS_WB_ERRSEQ)
+   mapping_set_error(mapping, ret);
goto out;
+   }
}
}
 out:
-- 
2.13.0

[PATCH v6 16/20] block: convert to errseq_t based writeback error tracking

2017-06-12 Thread Jeff Layton

This is a very minimal conversion to errseq_t based error tracking
for raw block device access.

Only real change that is strictly required is that we must ensure that
filemap_report_wb_err is unconditionally called after fsync, which is
now done if FS_WB_ERRSEQ is set in fs_flags. That ensures that the
errseq_t in the file is advanced to the latest value in the mapping.

Note that there are internal callers that call sync_blockdev and the
like that are not affected by this. They'll continue to use the
AS_EIO/AS_ENOSPC flags for error reporting like they always have for
now.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/block_dev.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4d62fe771587..4bf865cc99de 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -801,6 +801,7 @@ static struct file_system_type bd_type = {
.name   = "bdev",
.mount  = bd_mount,
.kill_sb= kill_anon_super,
+   .fs_flags   = FS_WB_ERRSEQ,
 };
 
 struct super_block *blockdev_superblock __read_mostly;
-- 
2.13.0

[PATCH v6 13/20] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-12 Thread Jeff Layton

Let's try to make this extra clear for fs authors.

Also, although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara <j...@suse.cz>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 Documentation/filesystems/vfs.txt | 48 ---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..0f6415c26385 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,47 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure need to do two things:
+
+1) call mapping_set_error to record the error in the address_space when
+one occurs.
+
+2) set FS_WB_ERRSEQ in the fs_flags field in the file_system_type to
+indicate to other subsystems that the filesystem wants to use errseq_t
+based error reporting for writeback.
+
+The flag may go away in the future or moved to an opt-out flag once
+the majority of filesystems are converted to use errseq_t based reporting.
 
 struct address_space_operations
 ---
@@ -804,7 +844,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +928,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.13.0

[PATCH v6 18/20] ext4: use errseq_t based error handling for reporting data writeback errors

2017-06-12 Thread Jeff Layton

Add the FS_WB_ERRSEQ flag to indicate to other subsystems that errseq_t
based error reporting for data writeback is in effect, and to opt-in to
reporting those errors in call_fsync.

ext4 uses the blockdev mapping for tracking metadata stored in the
pagecache. Sample its wb_err when opening a file and store that in
the f_md_wb_err field. Ensure that we check and advance the metadata
before returning in fsync.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors on this fs when there is a metadata writeback
failure.  That's not ideal, but we can possibly improve upon it in the
future.

ext4 also has several calls to filemap_fdatawait and
filemap_write_and_wait. Those will also be changed in a later patch to
versions that use errseq_t based reporting.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/ext4/dir.c   |  8 ++--
 fs/ext4/file.c  |  5 -
 fs/ext4/fsync.c | 23 +++
 fs/ext4/super.c |  6 +++---
 4 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 02ce7e7bbdf5..6e505269132c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct 
file * filp)
if (ret < 0)
return ret;
}
-   return dquot_file_open(inode, filp);
+   ret = dquot_file_open(inode, filp);
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 /*
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 9d549608fd30..09e28b7fd3f6 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -100,8 +100,10 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
tid_t commit_tid;
bool needs_barrier = false;
 
-   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
-   return -EIO;
+   if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb {
+   ret = -EIO;
+   goto out;
+   }
 
J_ASSERT(ext4_journal_current_handle() == NULL);
 
@@ -126,7 +128,8 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
 
ret = filemap_write_and_wait_range(inode->i_mapping, start, end);
if (ret)
-   return ret;
+   goto out;
+
/*
 * data=writeback,ordered:
 *  The caller's filemap_fdatawrite()/wait will sync the data.
@@ -152,12 +155,24 @@ int ext4_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
needs_barrier = true;
ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
-   issue_flush:
+issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
}
 out:
+   /*
+* Was there a metadata writeback error since last fsync?
+*
+* FIXME: ext4 tracks metadata with a whole-block device mapping. So,
+* if there is any sort of metadata writeback error, we'll report an
+* error on all open fds, even ones not associated with the inode
+*/
+   err = filemap_report_md_wb_err(file,
+   inode->i_sb->s_bdev->bd_inode->i_mapping);
+   if (!ret)
+   ret = err;
+
trace_ext4_sync_file_exit(inode, ret);
return ret;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d37c81f327e7..e98791ebee6f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -119,7 +119,7 @@ static struct file_system_type ext2_fs_type = {
.name   = "ext2",
.mount  = ext4_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV|FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("ext2");
 MODULE_ALIAS("ext2");
@@ -134,7 +134,7 @@ static struct file_system_type ext3_fs_type = {
.name   = "ext3",
.mount

[PATCH v6 13/13] btrfs: minimal conversion to errseq_t writeback error reporting on fsync

2017-06-12 Thread Jeff Layton

Internal callers of filemap_* functions are left as-is.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/btrfs/file.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index da1096eb1a40..4632f16bc49c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2011,7 +2011,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_trans_handle *trans;
struct btrfs_log_ctx ctx;
-   int ret = 0;
+   int ret = 0, err;
bool full_sync = 0;
u64 len;
 
@@ -2030,7 +2030,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 */
ret = start_ordered_ops(inode, start, end);
if (ret)
-   return ret;
+   goto out;
 
inode_lock(inode);
atomic_inc(>log_batch);
@@ -2227,6 +2227,9 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
ret = btrfs_end_transaction(trans);
}
 out:
+   err = filemap_report_wb_err(file);
+   if (!ret)
+   ret = err;
return ret > 0 ? -EIO : ret;
 }
 
-- 
2.13.0

[PATCH v6 15/20] fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set

2017-06-12 Thread Jeff Layton

Allow filesystems to opt-in to a final check of wb_err if FS_WB_ERRSEQ
is set. Technically, we could just plumb these calls into all of the
fsync operations, but I think this means less code, changes and churn.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/fs.h | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17ba6284ab14..ef3feeec80b2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1742,12 +1742,6 @@ static inline int call_mmap(struct file *file, struct 
vm_area_struct *vma)
return file->f_op->mmap(file, vma);
 }
 
-static inline int call_fsync(struct file *file, loff_t start, loff_t end,
-int datasync)
-{
-   return file->f_op->fsync(file, start, end, datasync);
-}
-
 ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
  unsigned long nr_segs, unsigned long fast_segs,
  struct iovec *fast_pointer,
@@ -2583,6 +2577,20 @@ static inline errseq_t filemap_sample_wb_err(struct 
address_space *mapping)
return errseq_sample(>wb_err);
 }
 
+static inline int call_fsync(struct file *file, loff_t start, loff_t end,
+int datasync)
+{
+   int ret;
+
+   ret = file->f_op->fsync(file, start, end, datasync);
+   if (file->f_mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ) {
+   int ret2 = filemap_report_wb_err(file);
+   if (!ret)
+   ret = ret2;
+   }
+   return ret;
+}
+
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
-- 
2.13.0

[PATCH v6 20/20] btrfs: minimal conversion to errseq_t writeback error reporting on fsync

2017-06-12 Thread Jeff Layton

Set the FS_WB_ERRSEQ flag to opt-in to errseq_t based reporting.
Internal call to filemap_* functions are left as-is.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/btrfs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f1cdd5058f1..c99af09cd3e7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2184,7 +2184,7 @@ static struct file_system_type btrfs_fs_type = {
.name   = "btrfs",
.mount  = btrfs_mount,
.kill_sb= btrfs_kill_super,
-   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA,
+   .fs_flags   = FS_REQUIRES_DEV | FS_BINARY_MOUNTDATA | FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("btrfs");
 
-- 
2.13.0

[PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-12 Thread Jeff Layton

Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t
based error reporting. Internal filemap_* calls are left as-is for now.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/xfs/xfs_super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 455a575f101d..28d3be187025 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
.name   = "xfs",
.mount  = xfs_fs_mount,
.kill_sb= kill_block_super,
-   .fs_flags   = FS_REQUIRES_DEV,
+   .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,
 };
 MODULE_ALIAS_FS("xfs");
 
-- 
2.13.0

[PATCH v6 11/13] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-12 Thread Jeff Layton

I waxed a little loquacious here, but I figured that more detail was
better, and writeback error handling is so hard to get right.

Although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara <j...@suse.cz>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 Documentation/filesystems/vfs.txt | 50 ---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..c3efdd833a3d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,49 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure should call
+filemap_set_wb_err to record the error in the address_space when it
+occurs.  Then, at the end of their fsync operation, they should call
+filemap_report_wb_err to ensure that the struct file's error cursor
+has advanced to the correct point in the stream of errors emitted by
+the backing device(s).
+
+Older kernels used a different method for tracking errors, based on flags
+in the address_space. We're currently switching everything over to use
+the infrastructure based on errseq_t values. During the transition,
+filesystem authors will want to also ensure their file_system_type has
+FS_WB_ERRSEQ set in fs_flags to ensure that shared infrastructure is
+aware of the model in use.
 
 struct address_space_operations
 ---
@@ -804,7 +846,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +930,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.13.0

[PATCH v6 10/13] ext4: add more robust reporting of metadata writeback errors

2017-06-12 Thread Jeff Layton

ext4 uses the blockdev mapping for tracking metadata stored in the
pagecache. Sample its wb_err when opening a file and store that in
the f_md_wb_err field.

Change ext4_sync_file to check for data errors first, and then check the
blockdev mapping for metadata errors afterward.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors when there is a metadata writeback failure.
That's not ideal, but we can possibly improve upon it in the future.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/ext4/dir.c   |  8 ++--
 fs/ext4/file.c  |  5 -
 fs/ext4/fsync.c | 13 +
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 831fd6beebf0..fe0d6e01c4b7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -435,7 +435,10 @@ static int ext4_file_open(struct inode * inode, struct 
file * filp)
if (ret < 0)
return ret;
}
-   return dquot_file_open(inode, filp);
+   ret = dquot_file_open(inode, filp);
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 /*
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 03d6259d8662..36363a6730d7 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -165,6 +165,19 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t 
end, int datasync)
err = filemap_report_wb_err(file);
if (!ret)
ret = err;
+
+   /*
+* Was there a metadata writeback error since last fsync?
+*
+* FIXME: ext4 tracks metadata with a whole-block device mapping. So,
+* if there is any sort of metadata writeback error, we'll report an
+* error on all open fds, even ones not associated with the inode
+*/
+   err = filemap_report_md_wb_err(file,
+   inode->i_sb->s_bdev->bd_inode->i_mapping);
+   if (!ret)
+   ret = err;
+
trace_ext4_sync_file_exit(inode, ret);
return ret;
 }
-- 
2.13.0

[PATCH v6 07/20] mm: clean up error handling in write_one_page

2017-06-12 Thread Jeff Layton

Don't try to check PageError since that's potentially racy and not
necessarily going to be set after writepage errors out.

Instead, check the mapping for an error after writepage returns.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 mm/page-writeback.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b901fe52b153..e369e8ea2a29 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2371,14 +2371,13 @@ int do_writepages(struct address_space *mapping, struct 
writeback_control *wbc)
  *
  * The page must be locked by the caller and will be unlocked upon return.
  *
- * write_one_page() returns a negative error code if I/O failed. Note that
- * the address_space is not marked for error. The caller must do this if
- * needed.
+ * Note that the mapping's AS_EIO/AS_ENOSPC flags will be cleared when this
+ * function returns.
  */
 int write_one_page(struct page *page)
 {
struct address_space *mapping = page->mapping;
-   int ret = 0;
+   int ret = 0, ret2;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
.nr_to_write = 1,
@@ -2391,15 +2390,15 @@ int write_one_page(struct page *page)
if (clear_page_dirty_for_io(page)) {
get_page(page);
ret = mapping->a_ops->writepage(page, );
-   if (ret == 0) {
+   if (ret == 0)
wait_on_page_writeback(page);
-   if (PageError(page))
-   ret = -EIO;
-   }
put_page(page);
} else {
unlock_page(page);
}
+
+   if (!ret)
+   ret = filemap_check_errors(mapping);
return ret;
 }
 EXPORT_SYMBOL(write_one_page);
-- 
2.13.0

[PATCH v6 00/20] fs: enhanced writeback error reporting with errseq_t (pile #1)

2017-06-12 Thread Jeff Layton

v6:
===
This is the sixth posting of the patchset to revamp the way writeback
errors are tracked and reported.

This is a smaller set than the last one. The main difference from the
last set is that this one just adds errseq_t based error reporting for
the purposes of fsync, while leaving the internal callers of filemap_*
functions and the like largely untouched.

Some of these patches have been posted separately, but I'm re-posting
them here to make it clear that they're prerequisites to the later
patches in the series.

Background:
===
The basic problem is that we have (for a very long time) tracked and
reported writeback errors based on two flags in the address_space:
AS_EIO and AS_ENOSPC. Those flags are cleared when they are checked,
so only the first caller to check them is able to consume them.

That model is quite unreliable though, for several related reasons:

* only the first fsync caller on the inode will see the error. In a
  world of containerized setups, that's no longer viable. Applications
  need to know that their writes are safely stored, and they can
  currently miss seeing errors that they should be aware of when
  they're not.

* there are a lot of internal callers to filemap_fdatawait* and
  filemap_write_and_wait* that clear the error flags but then never
  report them to userland in any fashion.

* Some internal callers report writeback errors, but can do so at
  non-sensical times. For instance, we might want to truncate a file,
  which triggers a pagecache flush. If that writeback fails, we might
  report that error to the truncate caller, but a subsequent fsync
  will likely not see it.

* Some internal callers try to reset the error flags after clearing
  them, but that's racy. Another task could check the flags between
  those two events.

Solution:
=
This patchset adds a new datatype called an errseq_t that represents a
sequence of errors. It's a u32, with a field for a POSIX-flavor error
and a counter, managed with atomics. We can sample that value at a
particular point in time, and can later tell whether there have been any
errors since that point.

That allows us to provide traditional check-and-clear fsync semantics
on every open file description in a lightweight fashion. fsync callers
no longer need to coordinate between one another in order to ensure
that errors at fsync time are handled correctly.

Strategy:
=
The aim with this set is to do the minimum possible to support for
reliable reporting of errors on fsync, without substantially changing
the internals of the filesystems themselves.

Most of the internal calls to filemap_fdatawait are left alone, so all
of the internal error error checking is done using the traditional flag
based checks. The only real difference here is more reliable reporting
of errors at fsync.

I think that we probably will want to eventually convert all of the
internal callers to use errseq_t based reporting too, but that can be
done in an incremental fashion in follow-on patchsets.

Testing:

I've primarily been testing this with a couple of new xfstests that I
will post separately. These tests use dm-error fault injection to flip
the underlying block device to start throwing I/O errors, and then test
the behavior of the filesystem layer on top of that.

Jeff Layton (20):
  mm: fix mapping_set_error call in me_pagecache_dirty
  buffer: use mapping_set_error instead of setting the flag
  fs: check for writeback errors after syncing out buffers in
generic_file_fsync
  buffer: set errors in mapping at the time that the error occurs
  mm: don't TestClearPageError in __filemap_fdatawait_range
  mm: drop "wait" parameter from write_one_page
  mm: clean up error handling in write_one_page
  lib: add errseq_t type and infrastructure for handling it
  fs: new infrastructure for writeback error handling and reporting
  mm: tracepoints for writeback error events
  mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error
  fs: add a new fstype flag to indicate how writeback errors are tracked
  Documentation: flesh out the section in vfs.txt on storing and
reporting writeback errors
  dax: set errors in mapping when writeback fails
  fs: have call_fsync call filemap_report_wb_err if FS_WB_ERRSEQ is set
  block: convert to errseq_t based writeback error tracking
  fs: add f_md_wb_err field to struct file for tracking metadata errors
  ext4: use errseq_t based error handling for reporting data writeback
errors
  xfs: minimal conversion to errseq_t writeback error reporting
  btrfs: minimal conversion to errseq_t writeback error reporting on
fsync

 Documentation/filesystems/vfs.txt |  48 -
 drivers/dax/device.c  |   1 +
 fs/block_dev.c|   2 +
 fs/btrfs/super.c  |   2 +-
 fs/buffer.c   |  20 ++--
 fs/dax.c  |  18 +++-
 fs/exofs/dir.c|   2 +-
 fs/ext2/dir.c

Re: [xfstests PATCH v3 5/5] btrfs: allow it to use $SCRATCH_LOGDEV

2017-06-08 Thread Jeff Layton

On Tue, 2017-06-06 at 17:19 +0800, Eryu Guan wrote:
> On Wed, May 31, 2017 at 09:08:20AM -0400, Jeff Layton wrote:
> > With btrfs, we can't really put the log on a separate device. What we
> > can do however is mirror the metadata across two devices and make the
> > data striped across all devices. When we turn on dmerror then the
> > metadata can fall back to using the other mirror while the data errors
> > out.
> > 
> > Note that the current incarnation of btrfs has a fixed 64k stripe
> > width. If that ever changes or becomes settable, we may need to adjust
> > the amount of data that the test program writes.
> > 
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > ---
> >  common/rc | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/common/rc b/common/rc
> > index 83765aacfb06..078270451b53 100644
> > --- a/common/rc
> > +++ b/common/rc
> > @@ -830,6 +830,8 @@ _scratch_mkfs()
> > ;;
> > btrfs)
> > mkfs_cmd="$MKFS_BTRFS_PROG"
> > +   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
> > +   mkfs_cmd="$mkfs_cmd -d raid0 -m raid1 $SCRATCH_LOGDEV"
> 
> I don't think this is the correct way to do it. If btrfs doesn't support
> external log device, then this test doesn't fit btrfs, or we need other
> ways to test btrfs.
> 
> One of the problems of this hack is that raid1 requires all devices are
> in the same size, we have a _require_scratch_dev_pool_equal_size() rule
> to check on it, but this hack doesn't do the proper check and test fails
> if SCRATCH_LOGDEV is smaller or bigger in size.
> 
> If btrfs "-d raid0 -m raid1" is capable to do this writeback error test,
> perhaps you can write a new btrfs test and mkfs with "-d raid0 -m raid1"
> explicitly. e.g.
> 
> ...
> _require_scratch_dev_pool 2
> _require_scratch_dev_pool_equal_size
> ...
> _scratch_mkfs "-d raid0 -m raid1"
> ...
> 
> Thanks,
> Eryu


Yeah, that's probably the right way to do this. It looks like btrfs also
has $SCRATCH_DEV_POOL, and we can probably base it on that. I'll look at
reworking it.

-- 
Jeff Layton <jlay...@redhat.com>

Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Jeff Layton

On Tue, 2017-06-06 at 10:17 -0700, Darrick J. Wong wrote:
> On Tue, Jun 06, 2017 at 08:23:25PM +0800, Eryu Guan wrote:
> > On Tue, Jun 06, 2017 at 06:15:57AM -0400, Jeff Layton wrote:
> > > On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> > > > On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > > > > I'm working on a set of kernel patches to change how writeback errors
> > > > > are handled and reported in the kernel. Instead of reporting a
> > > > > writeback error to only the first fsync caller on the file, I aim
> > > > > to make the kernel report them once on every file description.
> > > > > 
> > > > > This patch adds a test for the new behavior. Basically, open many fds
> > > > > to the same file, turn on dm_error, write to each of the fds, and then
> > > > > fsync them all to ensure that they all get an error back.
> > > > > 
> > > > > To do that, I'm adding a new tools/dmerror script that the C program
> > > > > can use to load the error table. For now, that's all it can do, but
> > > > > we can fill it out with other commands as necessary.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > > > 
> > > > Thanks for the new tests! And sorry for the late review..
> > > > 
> > > > It's testing a new behavior on error reporting on writeback, I'm not
> > > > sure if we can call it a new feature or it fixed a bug? But it's more
> > > > like a behavior change, I'm not sure how to categorize it.
> > > > 
> > > > Because if it's testing a new feature, we usually let test do proper
> > > > detection of current test environment (based on actual behavior not
> > > > kernel version) and _notrun on filesystems that don't have this feature
> > > > yet, instead of failing the test; if it's testing a bug fix, we could
> > > > leave the test fail on unfixed filesystems, this also serves as a
> > > > reminder that there's bug to fix.
> > > > 
> > > 
> > > Thanks for the review! I'm not sure how to categorize this either. Since
> > > the plan is to convert all the filesystems piecemeal, maybe we should
> > > just consider it a new feature.
> > 
> > Then we need a new _require rule to properly detect for the 'feature'
> > support. I'm not sure if this is doable, but something like
> > _require_statx, _require_seek_data_hole would be good.
> > 
> > > 
> > > > I pulled your test kernel tree, and test passed on EXT4 but failed on
> > > > other local filesystems (XFS, btrfs). I assume that's expected.
> > > > 
> > > > Besides this kinda high-level question, some minor comments inline.
> > > > 
> > > 
> > > Yes, ext4 should pass on my latest kernel tree, but everything else
> > > should fail. 

Oh, and I should mention that ext2/3 also pass when mounted using ext4
driver. Legacy ext2 driver sort of works, but it reports a few too many
errors because of the way the ext2_error macro works. That shouldn't be
too hard to fix, I just need some guidance on that one.

I had xfs and btrfs working with an earlier iteration of the patches,
but now that we're converting a fs at a time, it's a little more work to
get there. It shouldn't be too hard to do though. I'll probably re-post
in a few days, and will try to take a stab at XFS and btrfs conversion
too.

> > 
> > With the new _require rule, test should _notrun on XFS and btrfs then.
> 
> Frankly I personally prefer that upstream XFS fails until someone fixes it. :)
> (But that's just my opinion.)
> 
> That said, I'm not 100% sure what's required of XFS to play nicely with
> this new mechanism -- glancing at the ext* patches it looks like we'd
> need to set a fs flag and possibly change some or all of the "write
> cached dirty buffers out to disk" calls to their _since variants?

Yeah, that's pretty much the size of it.

In fact, the latter part (changing to the _since variants) is somewhat
optional. We can have the errseq_t based tracking coexist with the
AS_EIO/AS_ENOSPC flags. It's weird but I don't see a real downside to
preserving them until we've got more of this converted over.

In the latest branch I'm working on, I'm breaking up those changes into
different patches so it should be a little clearer for other fs
maintainers to see what I'm doing and why. Stay tuned...

> Metadata writeback errors are handled by retrying writes and/or shutting
> down the fs, so I think the f_md_wb_error case is already covered.

Thanks. I think we do need f_md_wb_err for ext2/4 though, IIUC?

> 
> That said, I agree that it's useful to detect that the kernel simply
> lacks any of the new wb error reporting at all, so therefore we can skip
> the tests.
> 

Suggestions on ways to implement such a check would be welcome. Maybe a
file in /sys or in debugfs?

-- 
Jeff Layton <jlay...@redhat.com>

Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Jeff Layton

On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > I'm working on a set of kernel patches to change how writeback errors
> > are handled and reported in the kernel. Instead of reporting a
> > writeback error to only the first fsync caller on the file, I aim
> > to make the kernel report them once on every file description.
> > 
> > This patch adds a test for the new behavior. Basically, open many fds
> > to the same file, turn on dm_error, write to each of the fds, and then
> > fsync them all to ensure that they all get an error back.
> > 
> > To do that, I'm adding a new tools/dmerror script that the C program
> > can use to load the error table. For now, that's all it can do, but
> > we can fill it out with other commands as necessary.
> > 
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> 
> Thanks for the new tests! And sorry for the late review..
> 
> It's testing a new behavior on error reporting on writeback, I'm not
> sure if we can call it a new feature or it fixed a bug? But it's more
> like a behavior change, I'm not sure how to categorize it.
> 
> Because if it's testing a new feature, we usually let test do proper
> detection of current test environment (based on actual behavior not
> kernel version) and _notrun on filesystems that don't have this feature
> yet, instead of failing the test; if it's testing a bug fix, we could
> leave the test fail on unfixed filesystems, this also serves as a
> reminder that there's bug to fix.
> 

Thanks for the review! I'm not sure how to categorize this either. Since
the plan is to convert all the filesystems piecemeal, maybe we should
just consider it a new feature.

> I pulled your test kernel tree, and test passed on EXT4 but failed on
> other local filesystems (XFS, btrfs). I assume that's expected.
> 
> Besides this kinda high-level question, some minor comments inline.
> 

Yes, ext4 should pass on my latest kernel tree, but everything else
should fail. 

> > ---
> >  common/dmerror |  13 ++--
> >  doc/auxiliary-programs.txt |   8 +++
> >  src/Makefile   |   2 +-
> >  src/fsync-err.c| 161 
> > +
> 
> New binary needs an entry in .gitignore file.
> 

OK, thanks. Will fix.

> >  tests/generic/999  |  76 +
> >  tests/generic/999.out  |   3 +
> >  tests/generic/group|   1 +
> >  tools/dmerror  |  44 +
> 
> This file is used by the test, then it should be in src/ directory and
> be installed along with other executable files on "make install".
> Because files under tools/ are not installed. Most people will run tests
> in the root dir of xfstests and this is not a problem, but there're
> still cases people do "make && make install" and run fstests from
> /var/lib/xfstests (default installation target).
> 

Ok, no problem. I'll move it. I wasn't sure here since dmerror is a
shell script, and most of the stuff in src/ is stuff that needs to be
built.
 
> >  8 files changed, 302 insertions(+), 6 deletions(-)
> >  create mode 100644 src/fsync-err.c
> >  create mode 100755 tests/generic/999
> >  create mode 100644 tests/generic/999.out
> >  create mode 100755 tools/dmerror
> > 
> > diff --git a/common/dmerror b/common/dmerror
> > index d46c5d0b7266..238baa213b1f 100644
> > --- a/common/dmerror
> > +++ b/common/dmerror
> > @@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
> > _notrun "Cannot run tests with DAX on dmerror devices"
> >  fi
> >  
> > -_dmerror_init()
> > +_dmerror_setup()
> >  {
> > local dm_backing_dev=$SCRATCH_DEV
> >  
> > -   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> > -
> > local blk_dev_size=`blockdev --getsz $dm_backing_dev`
> >  
> > DMERROR_DEV='/dev/mapper/error-test'
> >  
> > DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
> >  
> > +   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
> > +}
> > +
> > +_dmerror_init()
> > +{
> > +   _dmerror_setup
> > +   $DMSETUP_PROG remove error-test > /dev/null 2>&1
> > $DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
> > _fatal "failed to create dm linear device"
> > -
> > -   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
> >  }
> >  
> >  _dmerror_mount()
> > diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-pro

Re: [PATCH v5 08/17] dax: set errors in mapping when writeback fails

2017-06-05 Thread Jeff Layton

On Mon, 2017-06-05 at 19:01 -0600, Ross Zwisler wrote:
> On Wed, May 31, 2017 at 08:45:31AM -0400, Jeff Layton wrote:
> > Jan's description for this patch is much better than mine, so I'm
> > quoting it verbatim here:
> > 
> > -8<-
> > DAX currently doesn't set errors in the mapping when cache flushing
> > fails in dax_writeback_mapping_range(). Since this function can get
> > called only from fsync(2) or sync(2), this is actually as good as it can
> > currently get since we correctly propagate the error up from
> > dax_writeback_mapping_range() to filemap_fdatawrite()
> > 
> > However, in the future better writeback error handling will enable us to
> > properly report these errors on fsync(2) even if there are multiple file
> > descriptors open against the file or if sync(2) gets called before
> > fsync(2). So convert DAX to using standard error reporting through the
> > mapping.
> > -8<-
> > 
> > For now, only do this when the FS_WB_ERRSEQ flag is set. The
> > AS_EIO/AS_ENOSPC flags are not currently cleared in the older code when
> > writeback initiation fails, only when we discover an error after waiting
> > on writeback to complete, so we only want to do this with errseq_t based
> > error handling to prevent seeing duplicate errors on fsync.
> > 
> > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > Reviewed-by: Jan Kara <j...@suse.cz>
> > Reviewed-by: Christoph Hellwig <h...@lst.de>
> > Reviewed-and-Tested-by: Ross Zwisler <ross.zwis...@linux.intel.com>
> 
> Re-tested this version of the series with some injected DAX errors, and it
> looks good.

Excellent! Thanks very much for helping test it.

-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton

On Wed, 2017-05-31 at 14:37 -0700, Andrew Morton wrote:
> On Wed, 31 May 2017 17:31:49 -0400 Jeff Layton <jlay...@redhat.com> wrote:
> 
> > On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> > > On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton <jlay...@redhat.com> wrote:
> > > 
> > > > This is v5 of the patchset to improve how we're tracking and reporting
> > > > errors that occur during pagecache writeback.
> > > 
> > > I'm curious to know how you've been testing this?
> > >  Is that testing
> > > strong enough for us to be confident that all nature of I/O errors
> > > will be reported to userspace?
> > > 
> > 
> > That's a tall order. This is a difficult thing to test as these sorts of
> > errors are pretty rare by nature.
> > 
> > I have an xfstest that I posted just after this set that demonstrates
> > that it works correctly, at least on ext2/3/4 when run by the ext4
> > driver (ext2 legacy driver reports too many errors currently). I had
> > btrfs and xfs working on that test too in an earlier incarnation of this
> > set, so I think we can fix this in them as well without too much
> > difficulty.
> > 
> > I'm happy to run other tests if someone wants to suggest them.
> > 
> > Now, all that said, I don't think this will make things any worse than
> > they are today as far as reporting errors properly to userland goes.
> > It's rather easy for an incidental synchronous writeback request from an
> > internal caller to clear the AS_* flags today. This will at least ensure
> > that we're reporting errors since a well-defined point in time when you
> > call fsync.
> 
> Were you using error injection of some form?  If so, how was that all
> set up?
> 

Yes, it uses dm-error for fault injection.

The test basically does:

1) set up a dm-error device in a working configuration

2) build a scratch filesystem on it, with the log on a different device
in some fashion so metadata writeback will still succeed.

3) open the same file several times

4) flip dm-error device to non-working mode

5) write to each fd

6) fsync each fd

...do you get back an error on each fsync?

It then does a bit more to make sure they're cleared afterward as you'd
expect. That works for most block device based filesystems. I also have
a second xfstest that opens a block device and does the same basic
thing. That also works correctly with this patch series.

I still need to come up with a way to simulate errors on other fs'
though. We may need to plumb in some kernel-level fault injection on
some fs' to do that correctly. Suggestions welcome there.

With this series though, the idea is to convert one filesystem at a
time, so I think that should help mitigate some of the risk.

-- 
Jeff Layton <jlay...@redhat.com>

Re: [PATCH v5 00/17] fs: introduce new writeback error reporting and convert ext2 and ext4 to use it

2017-05-31 Thread Jeff Layton

On Wed, 2017-05-31 at 13:27 -0700, Andrew Morton wrote:
> On Wed, 31 May 2017 08:45:23 -0400 Jeff Layton <jlay...@redhat.com> wrote:
> 
> > This is v5 of the patchset to improve how we're tracking and reporting
> > errors that occur during pagecache writeback.
> 
> I'm curious to know how you've been testing this?

>  Is that testing
> strong enough for us to be confident that all nature of I/O errors
> will be reported to userspace?
> 

That's a tall order. This is a difficult thing to test as these sorts of
errors are pretty rare by nature.

I have an xfstest that I posted just after this set that demonstrates
that it works correctly, at least on ext2/3/4 when run by the ext4
driver (ext2 legacy driver reports too many errors currently). I had
btrfs and xfs working on that test too in an earlier incarnation of this
set, so I think we can fix this in them as well without too much
difficulty.

I'm happy to run other tests if someone wants to suggest them.

Now, all that said, I don't think this will make things any worse than
they are today as far as reporting errors properly to userland goes.
It's rather easy for an incidental synchronous writeback request from an
internal caller to clear the AS_* flags today. This will at least ensure
that we're reporting errors since a well-defined point in time when you
call fsync.
-- 
Jeff Layton <jlay...@redhat.com>

[xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-05-31 Thread Jeff Layton

I'm working on a set of kernel patches to change how writeback errors
are handled and reported in the kernel. Instead of reporting a
writeback error to only the first fsync caller on the file, I aim
to make the kernel report them once on every file description.

This patch adds a test for the new behavior. Basically, open many fds
to the same file, turn on dm_error, write to each of the fds, and then
fsync them all to ensure that they all get an error back.

To do that, I'm adding a new tools/dmerror script that the C program
can use to load the error table. For now, that's all it can do, but
we can fill it out with other commands as necessary.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 common/dmerror |  13 ++--
 doc/auxiliary-programs.txt |   8 +++
 src/Makefile   |   2 +-
 src/fsync-err.c| 161 +
 tests/generic/999  |  76 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   1 +
 tools/dmerror  |  44 +
 8 files changed, 302 insertions(+), 6 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

diff --git a/common/dmerror b/common/dmerror
index d46c5d0b7266..238baa213b1f 100644
--- a/common/dmerror
+++ b/common/dmerror
@@ -23,22 +23,25 @@ if [ $? -eq 0 ]; then
_notrun "Cannot run tests with DAX on dmerror devices"
 fi
 
-_dmerror_init()
+_dmerror_setup()
 {
local dm_backing_dev=$SCRATCH_DEV
 
-   $DMSETUP_PROG remove error-test > /dev/null 2>&1
-
local blk_dev_size=`blockdev --getsz $dm_backing_dev`
 
DMERROR_DEV='/dev/mapper/error-test'
 
DMLINEAR_TABLE="0 $blk_dev_size linear $dm_backing_dev 0"
 
+   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
+}
+
+_dmerror_init()
+{
+   _dmerror_setup
+   $DMSETUP_PROG remove error-test > /dev/null 2>&1
$DMSETUP_PROG create error-test --table "$DMLINEAR_TABLE" || \
_fatal "failed to create dm linear device"
-
-   DMERROR_TABLE="0 $blk_dev_size error $dm_backing_dev 0"
 }
 
 _dmerror_mount()
diff --git a/doc/auxiliary-programs.txt b/doc/auxiliary-programs.txt
index 21ef118596b6..191ac0596511 100644
--- a/doc/auxiliary-programs.txt
+++ b/doc/auxiliary-programs.txt
@@ -16,6 +16,7 @@ note the dependency with:
 Contents:
 
  - af_unix -- Create an AF_UNIX socket
+ - fsync-err   -- tests fsync error reporting after failed writeback
  - open_by_handle  -- open_by_handle_at syscall exercise
  - stat_test   -- statx syscall exercise
  - t_dir_type  -- print directory entries and their file type
@@ -30,6 +31,13 @@ af_unix
 
The af_unix program creates an AF_UNIX socket at the given location.
 
+fsync-err
+   Specialized program for testing how the kernel reports errors that
+   occur during writeback. Works in conjunction with the dmerror script
+   in tools/ to write data to a device, and then force it to fail
+   writeback and test that errors are reported during fsync and cleared
+   afterward.
+
 open_by_handle
 
The open_by_handle program exercises the open_by_handle_at() system
diff --git a/src/Makefile b/src/Makefile
index 4ec01975f8f7..b79c4d84d31b 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -13,7 +13,7 @@ TARGETS = dirstress fill fill2 getpagesize holes lstat64 \
multi_open_unlink dmiperf unwritten_sync genhashnames t_holes \
t_mmap_writev t_truncate_cmtime dirhash_collide t_rename_overwrite \
holetest t_truncate_self t_mmap_dio af_unix t_mmap_stale_pmd \
-   t_mmap_cow_race
+   t_mmap_cow_race fsync-err
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
diff --git a/src/fsync-err.c b/src/fsync-err.c
new file mode 100644
index ..cbeb37fb1790
--- /dev/null
+++ b/src/fsync-err.c
@@ -0,0 +1,161 @@
+/*
+ * fsync-err.c: test whether writeback errors are reported to all open fds
+ * and properly cleared as expected after being seen once on each
+ *
+ * Copyright (c) 2017: Jeff Layton <jlay...@redhat.com>
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * btrfs has a fixed stripewidth of 64k, so we need to write enough data to
+ * ensure that we hit both stripes.
+ *
+ * FIXME: have the test script pass in the length?
+ */
+#define BUFSIZE (65 * 1024)
+
+/* FIXME: should this be tunable */
+#define NUM_FDS10
+
+static void usage() {
+   fprintf(stderr, "Usage: fsync-err \n");
+}
+
+int main(int argc, char **argv)
+{
+   int fd[NUM_FDS], ret, i;
+   char *fname, *buf;
+
+   if (argc < 1) {
+

[xfstests PATCH v3 0/5] add a test for reporting writeback errors across all fds on fsync

2017-05-31 Thread Jeff Layton

This patchset is a companion to the Linux kernel patch series I recently
posted with the cover letter:

[PATCH v5 00/17] fs: introduce new writeback error reporting and convert 
ext2 and ext4 to use it

That patchset adds a new userland-visible change to report errors on
all open file descriptions when there is an error on fsync, not just
the first one to race in.

Note that this set contains a patch to emulate $SCRATCH_LOGDEV on btrfs,
but the kernel patches for that are not quite ready yet. The test did
pass on btrfs in an earlier incarnation of the set, however.

Jeff Layton (5):
  generic: add a writeback error handling test
  ext4: allow ext4 to use $SCRATCH_LOGDEV
  generic: test writeback error handling on dmerror devices
  ext3: allow it to put journal on a separate device when doing
scratch_mkfs
  btrfs: allow it to use $SCRATCH_LOGDEV

 common/dmerror |  13 ++--
 common/rc  |  16 -
 doc/auxiliary-programs.txt |   8 +++
 src/Makefile   |   2 +-
 src/fsync-err.c| 161 +
 tests/generic/998  |  64 ++
 tests/generic/998.out  |   2 +
 tests/generic/999  |  76 +
 tests/generic/999.out  |   3 +
 tests/generic/group|   2 +
 tools/dmerror  |  44 +
 11 files changed, 384 insertions(+), 7 deletions(-)
 create mode 100644 src/fsync-err.c
 create mode 100755 tests/generic/998
 create mode 100644 tests/generic/998.out
 create mode 100755 tests/generic/999
 create mode 100644 tests/generic/999.out
 create mode 100755 tools/dmerror

-- 
2.9.4

[xfstests PATCH v3 2/5] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-05-31 Thread Jeff Layton

The writeback error handling test requires that you put the journal on a
separate device. This allows us to use dmerror to simulate data
writeback failure, without affecting the journal.

xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
up the ext4 code so that it can do the same thing when _scratch_mkfs is
called.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
---
 common/rc | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/common/rc b/common/rc
index 743df427c047..391d36f373cd 100644
--- a/common/rc
+++ b/common/rc
@@ -676,6 +676,9 @@ _scratch_mkfs_ext4()
local tmp=`mktemp`
local mkfs_status
 
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
 
_scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
1>$tmp.mkfsstd
mkfs_status=$?
-- 
2.9.4

[xfstests PATCH v3 5/5] btrfs: allow it to use $SCRATCH_LOGDEV

2017-05-31 Thread Jeff Layton

With btrfs, we can't really put the log on a separate device. What we
can do however is mirror the metadata across two devices and make the
data striped across all devices. When we turn on dmerror then the
metadata can fall back to using the other mirror while the data errors
out.

Note that the current incarnation of btrfs has a fixed 64k stripe
width. If that ever changes or becomes settable, we may need to adjust
the amount of data that the test program writes.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 common/rc | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/common/rc b/common/rc
index 83765aacfb06..078270451b53 100644
--- a/common/rc
+++ b/common/rc
@@ -830,6 +830,8 @@ _scratch_mkfs()
;;
btrfs)
mkfs_cmd="$MKFS_BTRFS_PROG"
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   mkfs_cmd="$mkfs_cmd -d raid0 -m raid1 $SCRATCH_LOGDEV"
mkfs_filter="cat"
;;
ext3)
-- 
2.9.4

[xfstests PATCH v3 4/5] ext3: allow it to put journal on a separate device when doing scratch_mkfs

2017-05-31 Thread Jeff Layton

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 common/rc | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index 391d36f373cd..83765aacfb06 100644
--- a/common/rc
+++ b/common/rc
@@ -832,7 +832,16 @@ _scratch_mkfs()
mkfs_cmd="$MKFS_BTRFS_PROG"
mkfs_filter="cat"
;;
-   ext2|ext3)
+   ext3)
+   mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
+   mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
+
+   # put journal on separate device?
+   [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
+   $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
+   mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
+   ;;
+   ext2)
mkfs_cmd="$MKFS_PROG -t $FSTYP -- -F"
mkfs_filter="grep -v -e ^Warning: -e \"^mke2fs \""
;;
-- 
2.9.4

[PATCH v5 02/17] fs: new infrastructure for writeback error handling and reporting

2017-05-31 Thread Jeff Layton

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: Jan Kara <j...@suse.cz>
---
 drivers/dax/device.c |  1 +
 fs/block_dev.c   |  1 +
 fs/file_table.c  |  1 +
 fs/open.c|  3 +++
 include/linux/fs.h   | 53 
 mm/filemap.c | 38 +
 6 files changed, 97 insertions(+)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 006e657dfcb9..12943d19bfc4 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -499,6 +499,7 @@ static int dax_open(struct inode *inode, struct file *filp)
inode->i_mapping = __dax_inode->i_mapping;
inode->i_mapping->host = __dax_inode;
filp->f_mapping = inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
filp->private_data = dev_dax;
inode->i_flags = S_DAX;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 51959936..4d62fe771587 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1743,6 +1743,7 @@ static int blkdev_open(struct inode * inode, struct file 
* filp)
return -ENOMEM;
 
filp->f_mapping = bdev->bd_inode->i_mapping;
+   filp->f_wb_err = filemap_sample_wb_err(filp->f_mapping);
 
return blkdev_get(bdev, filp->f_mode, filp);
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 954d510b765a..72e861a35a7f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(const struct path *path, fmode_t 
mode,
file->f_path = *path;
file->f_inode = path->dentry->d_inode;
file->f_mapping = path->dentry->d_inode->i_mapping;
+   file->f_wb_err = filemap_sample_wb_err(file->f_mapping);
if ((mode & FMODE_READ) &&
 likely(fop->read || fop->read_iter))
mode |= FMOD

[PATCH v5 01/17] lib: add errseq_t type and infrastructure for handling it

2017-05-31 Thread Jeff Layton

An errseq_t is a way of recording errors in one place, and allowing any
number of "subscribers" to tell whether an error has been set again
since a previous time.

It's implemented as an unsigned 32-bit value that is managed with atomic
operations. The low order bits are designated to hold an error code
(max size of MAX_ERRNO). The upper bits are used as a counter.

The API works with consumers sampling an errseq_t value at a particular
point in time. Later, that value can be used to tell whether new errors
have been set since that time.

Note that there is a 1 in 512k risk of collisions here if new errors
are being recorded frequently, since we have so few bits to use as a
counter. To mitigate this, one bit is used as a flag to tell whether the
value has been sampled since a new value was recorded. That allows
us to avoid bumping the counter if no one has sampled it since it
was last bumped.

Later patches will build on this infrastructure to change how writeback
errors are tracked in the kernel.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
Reviewed-by: NeilBrown <ne...@suse.com>
---
 include/linux/errseq.h |  19 +
 lib/Makefile   |   2 +-
 lib/errseq.c   | 200 +
 3 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/errseq.h
 create mode 100644 lib/errseq.c

diff --git a/include/linux/errseq.h b/include/linux/errseq.h
new file mode 100644
index ..0d2555f310cd
--- /dev/null
+++ b/include/linux/errseq.h
@@ -0,0 +1,19 @@
+#ifndef _LINUX_ERRSEQ_H
+#define _LINUX_ERRSEQ_H
+
+/* See lib/errseq.c for more info */
+
+typedef u32errseq_t;
+
+void __errseq_set(errseq_t *eseq, int err);
+static inline void errseq_set(errseq_t *eseq, int err)
+{
+   /* Optimize for the common case of no error */
+   if (unlikely(err))
+   __errseq_set(eseq, err);
+}
+
+errseq_t errseq_sample(errseq_t *eseq);
+int errseq_check(errseq_t *eseq, errseq_t since);
+int errseq_check_and_advance(errseq_t *eseq, errseq_t *since);
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 0166fbc0fa81..519782d9ca3f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -41,7 +41,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o 
random32.o \
 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-once.o refcount.o usercopy.o
+once.o refcount.o usercopy.o errseq.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/errseq.c b/lib/errseq.c
new file mode 100644
index ..d129c0611c1f
--- /dev/null
+++ b/lib/errseq.c
@@ -0,0 +1,200 @@
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * An errseq_t is a way of recording errors in one place, and allowing any
+ * number of "subscribers" to tell whether it has changed since a previous
+ * point where it was sampled.
+ *
+ * It's implemented as an unsigned 32-bit value. The low order bits are
+ * designated to hold an error code (between 0 and -MAX_ERRNO). The upper bits
+ * are used as a counter. This is done with atomics instead of locking so that
+ * these functions can be called from any context.
+ *
+ * The general idea is for consumers to sample an errseq_t value. That value
+ * can later be used to tell whether any new errors have occurred since that
+ * sampling was done.
+ *
+ * Note that there is a risk of collisions if new errors are being recorded
+ * frequently, since we have so few bits to use as a counter.
+ *
+ * To mitigate this, one bit is used as a flag to tell whether the value has
+ * been sampled since a new value was recorded. That allows us to avoid bumping
+ * the counter if no one has sampled it since the last time an error was
+ * recorded.
+ *
+ * A new errseq_t should always be zeroed out.  A errseq_t value of all zeroes
+ * is the special (but common) case where there has never been an error. An all
+ * zero value thus serves as the "epoch" if one wishes to know whether there
+ * has ever been an error set since it was first initialized.
+ */
+
+/* The low bits are designated for error code (max of MAX_ERRNO) */
+#define ERRSEQ_SHIFT   ilog2(MAX_ERRNO + 1)
+
+/* This bit is used as a flag to indicate whether the value has been seen */
+#define ERRSEQ_SEEN(1 << ERRSEQ_SHIFT)
+
+/* The lowest bit of the counter */
+#define ERRSEQ_CTR_INC (1 << (ERRSEQ_SHIFT + 1))
+
+/**
+ * __errseq_set - set a errseq_t for later reporting
+ * @eseq: errseq_t field that should be set
+ * @err: error to set
+ *
+ * This function sets the error in *eseq, and increments the sequence counter
+ * if the last sequence was sampled at some point in the past.
+ *
+ * Any error set will always overwrite an existing error.
+ *
+ * Most

[PATCH v5 04/17] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-05-31 Thread Jeff Layton

Now that we have new infrastructure for handling writeback errors using
errseq_t, we need to convert the existing code to use it. We could
attempt to retrofit the old interfaces on top of the new, but there is
a conceptual disconnect here in the case of internal callers that
invoke filemap_fdatawait and the like.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can report the
latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

This means that we need to touch each filesystem that calls
filemap_check_errors in some fashion and ensure that we establish sane
"since" values for those callers. But...some code is shared between
filesystems and needs to be able to handle both error tracking schemes.

Add a new FS_WB_ERRSEQ flag to the fstype. When mapping_set_error is
called, set mapping->wb_err if it's set, along with setting the
"legacy" AS_EIO/AS_ENOSPC flags. When calling filemap_report_wb_err,
always clear the legacy flags out as well.

This should allow subsystems to use the new errseq_t based error
reporting while simultaneously allowing the traditional semantics of
AS_EIO/AS_ENOSPC flags.

Eventually, this flag should be removed once everything is converted
to errseq_t based error tracking.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 include/linux/fs.h  |  1 +
 include/linux/pagemap.h | 32 ++--
 mm/filemap.c|  7 +++
 3 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 293cbc7f3520..2f3bcf4eb73b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2021,6 +2021,7 @@ struct file_system_type {
 #define FS_BINARY_MOUNTDATA2
 #define FS_HAS_SUBTYPE 4
 #define FS_USERNS_MOUNT8   /* Can be mounted by userns 
root */
+#define FS_WB_ERRSEQ   16  /* errseq_t writeback err tracking */
 #define FS_RENAME_DOES_D_MOVE  32768   /* FS will handle d_move() during 
rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
   const char *, void *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 316a19f6b635..1dbc2dd6fdd2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -28,14 +28,34 @@ enum mapping_flags {
AS_NO_WRITEBACK_TAGS = 5,
 };
 
+/**
+ * mapping_set_error - record a writeback error in the address_space
+ * @mapping - the mapping in which an error should be set
+ * @error - the error to set in the mapping
+ *
+ * When writeback fails in some way, we must record that error so that
+ * userspace can be informed when fsync and the like are called.  We endeavor
+ * to report errors on any file that was open at the time of the error.  Some
+ * internal callers also need to know when writeback errors have occurred.
+ *
+ * When a writeback error occurs, most filesystems will want to call
+ * mapping_set_error to record the error in the mapping so that it can be
+ * reported when the application calls fsync(2).
+ */
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-   if (unlikely(error)) {
-   if (error == -ENOSPC)
-   set_bit(AS_ENOSPC, >flags);
-   else
-   set_bit(AS_EIO, >flags);
-   }
+   if (likely(!error))
+   return;
+
+   /* Record it in wb_err if fs is using errseq_t based error tracking */
+   if (mapping->host->i_sb->s_type->fs_flags & FS_WB_ERRSEQ)
+   filemap_set_wb_err(mapping, error);
+
+   /* Unconditionally record it in flags for now, for legacy callers */
+   if (error == -ENOSPC)
+   set_bit(AS_ENOSPC, >flags);
+   else
+   set_bit(AS_EIO, >flags);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index c5e19ea0bf12..97dc28f853fc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -580,6 +580,13 @@ int filemap_report_wb_err(struct file *file)
trace_filemap_report_wb_err(file, old);
spin_unlock(>f_lock);
}
+
+   /* Now clear the AS_* flags if any are set */
+   if (test_bit(AS_ENOSPC, >flags))
+   clear_bit(AS_ENOSPC, >flags);
+   if (test_bit(AS_EIO, >flags))
+   clear_bit(AS_EIO, >flags);
+
return err;
 }
 EXPORT_SYMBOL(filemap_report_wb_err);
-- 
2.9.4

[PATCH v5 05/17] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-05-31 Thread Jeff Layton

I waxed a little loquacious here, but I figured that more detail was
better, and writeback error handling is so hard to get right.

Although I think we'll eventually remove it once the transition is
complete, I've gone ahead and documented the FS_WB_ERRSEQ flag as well.

Cc: Jan Kara <j...@suse.cz>
Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 Documentation/filesystems/vfs.txt | 50 ---
 1 file changed, 47 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt 
b/Documentation/filesystems/vfs.txt
index f42b90687d40..c3efdd833a3d 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -576,7 +576,49 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
actually
 written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
-Writeback makes use of a writeback_control structure...
+Writeback makes use of a writeback_control structure to direct the
+operations.  This gives the the writepage and writepages operations some
+information about the nature of and reason for the writeback request,
+and the constraints under which it is being done.  It is also used to
+return information back to the caller about the result of a writepage or
+writepages request.
+
+Handling errors during writeback
+
+Most applications that utilize the pagecache will periodically call
+fsync to ensure that data written has made it to the backing store.
+When there is an error during writeback, expect that error to be
+reported when fsync is called.  After an error has been reported to
+fsync, subsequent fsync calls on the same file descriptor should return
+0, unless further writeback errors have occurred since the previous
+fsync.
+
+Ideally, the kernel would report an error only on file descriptions on
+which writes were done that subsequently failed to be written back.  The
+generic pagecache infrastructure does not track the file descriptions
+that have dirtied each individual page however, so determining which
+file descriptors should get back an error is not possible.
+
+Instead, the generic writeback error tracking infrastructure in the
+kernel settles for reporting errors to fsync on all file descriptions
+that were open at the time that the error occurred.  In a situation with
+multiple writers, all of them will get back an error on a subsequent fsync,
+even if all of the writes done through that particular file descriptor
+succeeded (or even if there were no writes on that file descriptor at all).
+
+Filesystems that wish to use this infrastructure should call
+filemap_set_wb_err to record the error in the address_space when it
+occurs.  Then, at the end of their fsync operation, they should call
+filemap_report_wb_err to ensure that the struct file's error cursor
+has advanced to the correct point in the stream of errors emitted by
+the backing device(s).
+
+Older kernels used a different method for tracking errors, based on flags
+in the address_space. We're currently switching everything over to use
+the infrastructure based on errseq_t values. During the transition,
+filesystem authors will want to also ensure their file_system_type has
+FS_WB_ERRSEQ set in fs_flags to ensure that shared infrastructure is
+aware of the model in use.
 
 struct address_space_operations
 ---
@@ -804,7 +846,8 @@ struct address_space_operations {
 The File Object
 ===
 
-A file object represents a file opened by a process.
+A file object represents a file opened by a process. This is also known
+as an "open file description" in POSIX parlance.
 
 
 struct file_operations
@@ -887,7 +930,8 @@ otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call
+  fsync: called by the fsync(2) system call. Also see the section above
+entitled "Handling errors during writeback".
 
   fasync: called by the fcntl(2) system call when asynchronous
(non-blocking) mode is enabled for a file
-- 
2.9.4

[PATCH v5 14/17] ext4: convert to errseq_t based error tracking

2017-05-31 Thread Jeff Layton

Sample the block device inode's errseq_t when opening a file, so we can
catch metadata writeback errors at fsync time. Change ext4_sync_file to
check for data errors first, and then check the blockdev for metadata
errors afterward.

There are also several internal callers of filemap_write_and_wait_* that
check the error code afterward. Convert them to the "_since" variants,
using the file->f_wb_err value as the "since" value. This means passing
file pointers to several functions instead of inode pointers.

Note that because metadata writeback errors are only tracked on a
per-device level, this does mean that we'll end up reporting an error on
all open file descriptors when there is a metadata writeback failure.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/ext4/dir.c |  8 ++--
 fs/ext4/ext4.h|  8 
 fs/ext4/extents.c | 24 ++--
 fs/ext4/file.c|  5 -
 fs/ext4/fsync.c   | 23 ++-
 fs/ext4/inode.c   | 19 ---
 fs/ext4/ioctl.c   |  9 +
 fs/ext4/super.c   |  9 +
 8 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e8b365000d73..6bbb19510f74 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -611,9 +611,13 @@ static int ext4_dx_readdir(struct file *file, struct 
dir_context *ctx)
 
 static int ext4_dir_open(struct inode * inode, struct file * filp)
 {
+   int ret = 0;
+
if (ext4_encrypted_inode(inode))
-   return fscrypt_get_encryption_info(inode) ? -EACCES : 0;
-   return 0;
+   ret = fscrypt_get_encryption_info(inode) ? -EACCES : 0;
+   if (!ret)
+   filp->f_md_wb_err = 
filemap_sample_wb_err(inode->i_sb->s_bdev->bd_inode->i_mapping);
+   return ret;
 }
 
 static int ext4_release_dir(struct inode *inode, struct file *filp)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8e8046104f4d..e3ab27db43d0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2468,12 +2468,12 @@ extern void ext4_clear_inode(struct inode *);
 extern int  ext4_file_getattr(const struct path *, struct kstat *, u32, 
unsigned int);
 extern int  ext4_sync_inode(handle_t *, struct inode *);
 extern void ext4_dirty_inode(struct inode *, int);
-extern int ext4_change_inode_journal_flag(struct inode *, int);
+extern int ext4_change_inode_journal_flag(struct file *, int);
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
-extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
+extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
 extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int 
nblocks);
 extern void ext4_set_inode_flags(struct inode *);
 extern int ext4_alloc_da_blocks(struct inode *inode);
@@ -3143,8 +3143,8 @@ extern ext4_lblk_t ext4_ext_next_allocated_block(struct 
ext4_ext_path *path);
 extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
__u64 start, __u64 len);
 extern int ext4_ext_precache(struct inode *inode);
-extern int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len);
-extern int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len);
+extern int ext4_collapse_range(struct file *file, loff_t offset, loff_t len);
+extern int ext4_insert_range(struct file *file, loff_t offset, loff_t len);
 extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
struct inode *inode2, ext4_lblk_t lblk1,
 ext4_lblk_t lblk2,  ext4_lblk_t count,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2a97dff87b96..7e108fda9ae9 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4934,17 +4934,17 @@ long ext4_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
return -EOPNOTSUPP;
 
if (mode & FALLOC_FL_PUNCH_HOLE)
-   return ext4_punch_hole(inode, offset, len);
+   return ext4_punch_hole(file, offset, len);
 
ret = ext4_convert_inline_data(inode);
if (ret)
return ret;
 
if (mode & FALLOC_FL_COLLAPSE_RANGE)
-   return ext4_collapse_range(inode, offset, len);
+   return ext4_collapse_range(file, offset, len);
 
if (mode & FALLOC_FL_INSERT_RANGE)
-   return ext4_insert_range(inode, offset, len);
+   return ext4_insert_range(file, offset, len);
 
if (mode & FALLOC_FL_ZERO_RANGE)
return ext4_zero_range(file, offset, len, mode);
@@ -5444,14 +5444,16 @@ ext4_ext_shift_extents(struct inode *inode, handle_t 
*handle,
  * This implements the fallocate's collapse range functionality for ext4
  * Returns: 0 and non-zero

[PATCH v5 12/17] fs: allow __generic_file_fsync to support both flavors of error reporting

2017-05-31 Thread Jeff Layton

For now, we add a FS_WB_ERRSEQ check to know how to handle it.

Signed-off-by: Jeff Layton <jlay...@redhat.com>
---
 fs/libfs.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 1dec90819366..2ae58a252718 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -971,10 +971,18 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
 int datasync)
 {
struct inode *inode = file->f_mapping->host;
-   int err;
-   int ret;
-
-   err = filemap_write_and_wait_range(inode->i_mapping, start, end);
+   int err, ret;
+   bool use_errseq = inode->i_sb->s_type->fs_flags & FS_WB_ERRSEQ;
+   errseq_t since;
+
+   if (use_errseq) {
+   since = READ_ONCE(file->f_wb_err);
+   err = filemap_write_and_wait_range_since(inode->i_mapping,
+   start, end, since);
+   } else {
+   err = filemap_write_and_wait_range(inode->i_mapping,
+   start, end);
+   }
if (err)
return err;
 
@@ -988,11 +996,15 @@ int __generic_file_fsync(struct file *file, loff_t start, 
loff_t end,
err = sync_inode_metadata(inode, 1);
if (ret == 0)
ret = err;
-
 out:
inode_unlock(inode);
-   err = filemap_check_errors(inode->i_mapping);
-   return ret ? ret : err;
+   if (ret == 0) {
+   if (use_errseq)
+   err = filemap_check_wb_err(inode->i_mapping, since);
+   else
+   err = filemap_check_errors(inode->i_mapping);
+   }
+   return ret;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
-- 
2.9.4

1 2 >

1 - 100 of 159 matches

Mail list logo