subject:"\[RFC PATCH 0\/4\] fs\: introduce new writeback error tracking infrastructure and convert ext4 to use it"

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-10 Thread Jeff Layton

On Mon, 2017-04-10 at 09:15 +1000, NeilBrown wrote:
> On Fri, Apr 07 2017, Jeff Layton wrote:
> 
> > 
> > > ... and then there's no need to update if it's the same errno and nobody's
> > > seen it:
> > > 
> > >   if (old == new)
> > >   break;
> > > 
> > 
> > No, we can't do this. The thing could have just been updated by a task
> > that is setting the "seen" bit. We don't want to lose the error here. We
> > always have to do the cmpxchg on the set_wb_error side, I think.
> 
> I don't follow your logic.
> If (old == new) then there was a moment since this function started when
> performing the cmpxchg() would not have changed the contents of memory.
> So let's pretend it did actually happen at that moment, and not change
> memory.
> 
> If we race with a task setting the "seen" bit, then it will have seen
> the error *after* the new error, that this thread is reporting, actually
> happened.  So the result is still correct.
> 

Ok, that does make sense. I'll plan to do that.

There's also a bug in the last patch that I sent. We need to mark the
SEEN bit when we sample the value at open time, so we need a
filemap_sample_wb_error function to grab the current wb_err_t and mark
it SEEN if necessary.

That also gives us a way to handle something like filemap_write_and_wait
(which doesn't take a struct file). We can sample the wb_err_t prior to
starting writeback, and then return an error if anything failed after
that point.

I think that's probably close enough to how the current code works that
we can use it to make drop-in replacements for filemap_write_and_wait*
which should keep us from having to change so much existing code here.

filemap_check_errors would need to take a previously-sampled wb_err_t
argument, but only the lowest-level callers of that and
filemap_fdatawait* would need to deal with them directly.
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-10 Thread Jeff Layton

On Mon, 2017-04-10 at 09:15 +1000, NeilBrown wrote:
> On Fri, Apr 07 2017, Jeff Layton wrote:
> 
> > 
> > > ... and then there's no need to update if it's the same errno and nobody's
> > > seen it:
> > > 
> > >   if (old == new)
> > >   break;
> > > 
> > 
> > No, we can't do this. The thing could have just been updated by a task
> > that is setting the "seen" bit. We don't want to lose the error here. We
> > always have to do the cmpxchg on the set_wb_error side, I think.
> 
> I don't follow your logic.
> If (old == new) then there was a moment since this function started when
> performing the cmpxchg() would not have changed the contents of memory.
> So let's pretend it did actually happen at that moment, and not change
> memory.
> 
> If we race with a task setting the "seen" bit, then it will have seen
> the error *after* the new error, that this thread is reporting, actually
> happened.  So the result is still correct.
> 

Ok, that does make sense. I'll plan to do that.

There's also a bug in the last patch that I sent. We need to mark the
SEEN bit when we sample the value at open time, so we need a
filemap_sample_wb_error function to grab the current wb_err_t and mark
it SEEN if necessary.

That also gives us a way to handle something like filemap_write_and_wait
(which doesn't take a struct file). We can sample the wb_err_t prior to
starting writeback, and then return an error if anything failed after
that point.

I think that's probably close enough to how the current code works that
we can use it to make drop-in replacements for filemap_write_and_wait*
which should keep us from having to change so much existing code here.

filemap_check_errors would need to take a previously-sampled wb_err_t
argument, but only the lowest-level callers of that and
filemap_fdatawait* would need to deal with them directly.
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-09 Thread NeilBrown

On Fri, Apr 07 2017, Jeff Layton wrote:

>
>> ... and then there's no need to update if it's the same errno and nobody's
>> seen it:
>> 
>>  if (old == new)
>>  break;
>> 
>
> No, we can't do this. The thing could have just been updated by a task
> that is setting the "seen" bit. We don't want to lose the error here. We
> always have to do the cmpxchg on the set_wb_error side, I think.

I don't follow your logic.
If (old == new) then there was a moment since this function started when
performing the cmpxchg() would not have changed the contents of memory.
So let's pretend it did actually happen at that moment, and not change
memory.

If we race with a task setting the "seen" bit, then it will have seen
the error *after* the new error, that this thread is reporting, actually
happened.  So the result is still correct.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-09 Thread NeilBrown

On Fri, Apr 07 2017, Jeff Layton wrote:

>
>> ... and then there's no need to update if it's the same errno and nobody's
>> seen it:
>> 
>>  if (old == new)
>>  break;
>> 
>
> No, we can't do this. The thing could have just been updated by a task
> that is setting the "seen" bit. We don't want to lose the error here. We
> always have to do the cmpxchg on the set_wb_error side, I think.

I don't follow your logic.
If (old == new) then there was a moment since this function started when
performing the cmpxchg() would not have changed the contents of memory.
So let's pretend it did actually happen at that moment, and not change
memory.

If we race with a task setting the "seen" bit, then it will have seen
the error *after* the new error, that this thread is reporting, actually
happened.  So the result is still correct.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-07 Thread Jeff Layton

On Thu, 2017-04-06 at 13:05 -0700, Matthew Wilcox wrote:
> On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote:
> > @@ -868,6 +869,7 @@ struct file {
> > struct list_headf_tfile_llink;
> >  #endif /* #ifdef CONFIG_EPOLL */
> > struct address_space*f_mapping;
> > +   u32 f_wb_err;
> >  } __attribute__((aligned(4))); /* lest something weird decides that 2 
> > is OK */
> >  
> 
> I think we can squeeze that in next to f_flags?
> 

Sure, will do. I meant to look at pahole output and see if there are
existing holes.

> > +/**
> > + * filemap_set_wb_error - set the wb error in the mapping for later 
> > reporting
> > + * @mapping: mapping in which the error should be set
> > + * @err: error to set. must be negative value but not less than -MAX_ERRNO
> 
> Do we want to have users call filemap_set_wb_error(mapping, EIO)
> or filemap_set_wb_error(mapping, -EIO)?  Either way, we can assert
> that it's in the correct range (oh look, we have at least one user of
> mapping_set_error calling it with a positive errno ...)
> 

Yeah, I sent a patch for that a while back but I don't think anyone
picked it up. Luckily that caller is harmless since EIO just ends up in
the default case and gets turned into -EIO.

> I've been playing with positive or negative errnos for the xarray, and
> positive looks better to me, although there's a definite advantage to
> being able to just call filemap_set_wb_error(mapping, result).
> 

That's my main rationale. We generally use negative error codes in the
kernel, so let's do what's easiest for most callsites. I say negative
error codes here.


> #define XAS_ERROR(errno)((struct xa_node *)((errno << 1) | 1))
> 
> static inline int xas_error(const struct xa_state *xas)
> {
> unsigned long v = (unsigned long)xas->xa_node;
> return (v & 1) ? -(v >> 1) : 0;
> }
> 
> static inline void xas_set_err(struct xa_state *xas, unsigned long err)
> {
> XA_BUG_ON(err > MAX_ERRNO);
> xas->xa_node = XAS_ERROR(err);
> }
> 
> > +   /*
> > +* Ensure the error code actually fits where we want it to go. If it
> > +* doesn't then just throw a warning and don't record anything.
> > +*/
> > +   if (unlikely(err > 0 || err < -MAX_ERRNO)) {
> > +   WARN(1, "err=%d\n", err);
> > +   return;
> > +   }
> 
> Cute trick to make this more succinct:
> 
>   if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err)
>   return;
> or even ...
> 
>   if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err)
>   return;
> 

Nice. I always forget that WARN has a return. Will fix.

> > +   /* Clear out error bits and set new error */
> > +   new = (old & ~MAX_ERRNO) | -err;
> > +
> > +   /* Only increment if someone has looked at it */
> > +   if (old & WB_ERR_SEEN) {
> > +   new += WB_ERR_CTR_INC;
> > +   new &= ~WB_ERR_SEEN;
> > +   }
> 
> Although we always want to clear out the SEEN bit if we're updating ... so
> 
>   new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err;
> 
>   /* Only increment if someone has looked at it */
>   if (old & WB_ERR_SEEN)
>   new += WB_ERR_CTR_INC;
> 

Sure, that is more succinct.

> ... and then there's no need to update if it's the same errno and nobody's
> seen it:
> 
>   if (old == new)
>   break;
> 

No, we can't do this. The thing could have just been updated by a task
that is setting the "seen" bit. We don't want to lose the error here. We
always have to do the cmpxchg on the set_wb_error side, I think.

> [...]
> 
> > +   /*
> > +* We always store values with the "seen" bit set, so if this
> > +* matches what we already have, then we can call it done.
> > +* There is nothing to update so just return 0.
> > +*/
> > +   if (old == file->f_wb_err)
> > +   break;
> > +
> > +   /* set flag and try to swap it into place */
> > +   new = old | WB_ERR_SEEN;
> 
> Again, I think we should avoid the cmpxchg with:
> 
>   if (old == new)
>   break;
> 

Yeah, we may be able to do this one. I had myself convinced otherwise
yesterday, but I think you may be right.

> > +   cur = cmpxchg(>wb_err, old, new);
> > +
> > +   /*
> > +* We can quit now if we successfully swapped in the new value
> > +* or someone else beat us to it with the same value that we
> > +* were planning to store.
> > +*/
> > +   if (likely(cur == old || cur == new)) {
> > +   file->f_wb_err = new;
> > +   err = -(new & MAX_ERRNO);
> > +   break;
> > +   }
> > +
> > +   /* Raced with an update, try again */
> > +   old = cur;
> 
> Well ... should we?  We're

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-07 Thread Jeff Layton

On Thu, 2017-04-06 at 13:05 -0700, Matthew Wilcox wrote:
> On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote:
> > @@ -868,6 +869,7 @@ struct file {
> > struct list_headf_tfile_llink;
> >  #endif /* #ifdef CONFIG_EPOLL */
> > struct address_space*f_mapping;
> > +   u32 f_wb_err;
> >  } __attribute__((aligned(4))); /* lest something weird decides that 2 
> > is OK */
> >  
> 
> I think we can squeeze that in next to f_flags?
> 

Sure, will do. I meant to look at pahole output and see if there are
existing holes.

> > +/**
> > + * filemap_set_wb_error - set the wb error in the mapping for later 
> > reporting
> > + * @mapping: mapping in which the error should be set
> > + * @err: error to set. must be negative value but not less than -MAX_ERRNO
> 
> Do we want to have users call filemap_set_wb_error(mapping, EIO)
> or filemap_set_wb_error(mapping, -EIO)?  Either way, we can assert
> that it's in the correct range (oh look, we have at least one user of
> mapping_set_error calling it with a positive errno ...)
> 

Yeah, I sent a patch for that a while back but I don't think anyone
picked it up. Luckily that caller is harmless since EIO just ends up in
the default case and gets turned into -EIO.

> I've been playing with positive or negative errnos for the xarray, and
> positive looks better to me, although there's a definite advantage to
> being able to just call filemap_set_wb_error(mapping, result).
> 

That's my main rationale. We generally use negative error codes in the
kernel, so let's do what's easiest for most callsites. I say negative
error codes here.


> #define XAS_ERROR(errno)((struct xa_node *)((errno << 1) | 1))
> 
> static inline int xas_error(const struct xa_state *xas)
> {
> unsigned long v = (unsigned long)xas->xa_node;
> return (v & 1) ? -(v >> 1) : 0;
> }
> 
> static inline void xas_set_err(struct xa_state *xas, unsigned long err)
> {
> XA_BUG_ON(err > MAX_ERRNO);
> xas->xa_node = XAS_ERROR(err);
> }
> 
> > +   /*
> > +* Ensure the error code actually fits where we want it to go. If it
> > +* doesn't then just throw a warning and don't record anything.
> > +*/
> > +   if (unlikely(err > 0 || err < -MAX_ERRNO)) {
> > +   WARN(1, "err=%d\n", err);
> > +   return;
> > +   }
> 
> Cute trick to make this more succinct:
> 
>   if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err)
>   return;
> or even ...
> 
>   if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err)
>   return;
> 

Nice. I always forget that WARN has a return. Will fix.

> > +   /* Clear out error bits and set new error */
> > +   new = (old & ~MAX_ERRNO) | -err;
> > +
> > +   /* Only increment if someone has looked at it */
> > +   if (old & WB_ERR_SEEN) {
> > +   new += WB_ERR_CTR_INC;
> > +   new &= ~WB_ERR_SEEN;
> > +   }
> 
> Although we always want to clear out the SEEN bit if we're updating ... so
> 
>   new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err;
> 
>   /* Only increment if someone has looked at it */
>   if (old & WB_ERR_SEEN)
>   new += WB_ERR_CTR_INC;
> 

Sure, that is more succinct.

> ... and then there's no need to update if it's the same errno and nobody's
> seen it:
> 
>   if (old == new)
>   break;
> 

No, we can't do this. The thing could have just been updated by a task
that is setting the "seen" bit. We don't want to lose the error here. We
always have to do the cmpxchg on the set_wb_error side, I think.

> [...]
> 
> > +   /*
> > +* We always store values with the "seen" bit set, so if this
> > +* matches what we already have, then we can call it done.
> > +* There is nothing to update so just return 0.
> > +*/
> > +   if (old == file->f_wb_err)
> > +   break;
> > +
> > +   /* set flag and try to swap it into place */
> > +   new = old | WB_ERR_SEEN;
> 
> Again, I think we should avoid the cmpxchg with:
> 
>   if (old == new)
>   break;
> 

Yeah, we may be able to do this one. I had myself convinced otherwise
yesterday, but I think you may be right.

> > +   cur = cmpxchg(>wb_err, old, new);
> > +
> > +   /*
> > +* We can quit now if we successfully swapped in the new value
> > +* or someone else beat us to it with the same value that we
> > +* were planning to store.
> > +*/
> > +   if (likely(cur == old || cur == new)) {
> > +   file->f_wb_err = new;
> > +   err = -(new & MAX_ERRNO);
> > +   break;
> > +   }
> > +
> > +   /* Raced with an update, try again */
> > +   old = cur;
> 
> Well ... should we?  We're

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread NeilBrown

On Thu, Apr 06 2017, Jeff Layton wrote:

>
> I tried to avoid updating things unnecesssarily. I could use some
> guidance on how to specify the constants in terms of MAX_ERRNO as well.

ilog2() defined in include/linux/log2.h

And you have MAX_ERROR in one comment, instead of MAX_ERRNO :-)

>  
> -  flush: called by the close(2) system call to flush a file
> +  flush: called by the close(2) system call to flush a file. Writeback
> + errors not previously reported via fsync should be reported
> + here as you would for fsync.

"could", not "should".  I think it is agreed that this is a good
idea, is it?

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7251f7bb45e8..f33857113ff4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -394,6 +394,7 @@ struct address_space {
>   gfp_t   gfp_mask;   /* implicit gfp mask for 
> allocations */
>   struct list_headprivate_list;   /* ditto */
>   void*private_data;  /* ditto */
> + u32 wb_err;

I would rather this was a wb_err_t or similar, and that the functions
which implement it take a pointer to a wb_err_t.
Then the 'error' in 'struct nfs_open_context' could become a 'wb_err_t',
and nfs could use these functions to do error tracking the way it wants
to.

Thanks - looking good.

NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread NeilBrown

On Thu, Apr 06 2017, Jeff Layton wrote:

>
> I tried to avoid updating things unnecesssarily. I could use some
> guidance on how to specify the constants in terms of MAX_ERRNO as well.

ilog2() defined in include/linux/log2.h

And you have MAX_ERROR in one comment, instead of MAX_ERRNO :-)

>  
> -  flush: called by the close(2) system call to flush a file
> +  flush: called by the close(2) system call to flush a file. Writeback
> + errors not previously reported via fsync should be reported
> + here as you would for fsync.

"could", not "should".  I think it is agreed that this is a good
idea, is it?

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7251f7bb45e8..f33857113ff4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -394,6 +394,7 @@ struct address_space {
>   gfp_t   gfp_mask;   /* implicit gfp mask for 
> allocations */
>   struct list_headprivate_list;   /* ditto */
>   void*private_data;  /* ditto */
> + u32 wb_err;

I would rather this was a wb_err_t or similar, and that the functions
which implement it take a pointer to a wb_err_t.
Then the 'error' in 'struct nfs_open_context' could become a 'wb_err_t',
and nfs could use these functions to do error tracking the way it wants
to.

Thanks - looking good.

NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread NeilBrown

On Thu, Apr 06 2017, Matthew Wilcox wrote:

> On Thu, Apr 06, 2017 at 03:12:42PM +1000, NeilBrown wrote:
>> On Wed, Apr 05 2017, Matthew Wilcox wrote:
>> 
>> > On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
>> >> If you are concerned about space in 'struct address_space', just prune
>> >> some wastage.
>> >
>> > I'm trying to (via wlists).  still buggy though.
>> 
>> Cool.
>> (I wonder what a wlist is weighted list?)
>
> A homonym ;-)  Waitlists.  Yet Another Variation of a doubly linked list.
> This variation has only a single head pointer (like the hlist), allows
> O(1) deletion from any point in the list (but requires that you know the
> head of the list).  The tail of the list is pointed to by head->prev.
>
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/wlist
>
> It also needs updating to the latest tree; WW locks got revamped.

So nothing points to the head, meaning you need both the node and the
head to delete something.  But the head is smaller.
I can see how that would be useful.
Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread NeilBrown

On Thu, Apr 06 2017, Matthew Wilcox wrote:

> On Thu, Apr 06, 2017 at 03:12:42PM +1000, NeilBrown wrote:
>> On Wed, Apr 05 2017, Matthew Wilcox wrote:
>> 
>> > On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
>> >> If you are concerned about space in 'struct address_space', just prune
>> >> some wastage.
>> >
>> > I'm trying to (via wlists).  still buggy though.
>> 
>> Cool.
>> (I wonder what a wlist is weighted list?)
>
> A homonym ;-)  Waitlists.  Yet Another Variation of a doubly linked list.
> This variation has only a single head pointer (like the hlist), allows
> O(1) deletion from any point in the list (but requires that you know the
> head of the list).  The tail of the list is pointed to by head->prev.
>
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/wlist
>
> It also needs updating to the latest tree; WW locks got revamped.

So nothing points to the head, meaning you need both the node and the
head to delete something.  But the head is smaller.
I can see how that would be useful.
Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Matthew Wilcox

On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote:
> @@ -868,6 +869,7 @@ struct file {
>   struct list_headf_tfile_llink;
>  #endif /* #ifdef CONFIG_EPOLL */
>   struct address_space*f_mapping;
> + u32 f_wb_err;
>  } __attribute__((aligned(4)));   /* lest something weird decides that 2 
> is OK */
>  

I think we can squeeze that in next to f_flags?

> +/**
> + * filemap_set_wb_error - set the wb error in the mapping for later reporting
> + * @mapping: mapping in which the error should be set
> + * @err: error to set. must be negative value but not less than -MAX_ERRNO

Do we want to have users call filemap_set_wb_error(mapping, EIO)
or filemap_set_wb_error(mapping, -EIO)?  Either way, we can assert
that it's in the correct range (oh look, we have at least one user of
mapping_set_error calling it with a positive errno ...)

I've been playing with positive or negative errnos for the xarray, and
positive looks better to me, although there's a definite advantage to
being able to just call filemap_set_wb_error(mapping, result).

#define XAS_ERROR(errno)((struct xa_node *)((errno << 1) | 1))

static inline int xas_error(const struct xa_state *xas)
{
unsigned long v = (unsigned long)xas->xa_node;
return (v & 1) ? -(v >> 1) : 0;
}

static inline void xas_set_err(struct xa_state *xas, unsigned long err)
{
XA_BUG_ON(err > MAX_ERRNO);
xas->xa_node = XAS_ERROR(err);
}

> + /*
> +  * Ensure the error code actually fits where we want it to go. If it
> +  * doesn't then just throw a warning and don't record anything.
> +  */
> + if (unlikely(err > 0 || err < -MAX_ERRNO)) {
> + WARN(1, "err=%d\n", err);
> + return;
> + }

Cute trick to make this more succinct:

if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err)
return;
or even ...

if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err)
return;

> + /* Clear out error bits and set new error */
> + new = (old & ~MAX_ERRNO) | -err;
> +
> + /* Only increment if someone has looked at it */
> + if (old & WB_ERR_SEEN) {
> + new += WB_ERR_CTR_INC;
> + new &= ~WB_ERR_SEEN;
> + }

Although we always want to clear out the SEEN bit if we're updating ... so

new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err;

/* Only increment if someone has looked at it */
if (old & WB_ERR_SEEN)
new += WB_ERR_CTR_INC;

... and then there's no need to update if it's the same errno and nobody's
seen it:

if (old == new)
break;

[...]

> + /*
> +  * We always store values with the "seen" bit set, so if this
> +  * matches what we already have, then we can call it done.
> +  * There is nothing to update so just return 0.
> +  */
> + if (old == file->f_wb_err)
> + break;
> +
> + /* set flag and try to swap it into place */
> + new = old | WB_ERR_SEEN;

Again, I think we should avoid the cmpxchg with:

if (old == new)
break;

> + cur = cmpxchg(>wb_err, old, new);
> +
> + /*
> +  * We can quit now if we successfully swapped in the new value
> +  * or someone else beat us to it with the same value that we
> +  * were planning to store.
> +  */
> + if (likely(cur == old || cur == new)) {
> + file->f_wb_err = new;
> + err = -(new & MAX_ERRNO);
> + break;
> + }
> +
> + /* Raced with an update, try again */
> + old = cur;

Well ... should we?  We're returning an error which is new to this fd anyway.
Do we want to return the most recent error by a nanosecond, or should we
return the previous one and then see this one next time we call fsync()?

I'd lean towards not looping here; not even looking at 'cur'.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Matthew Wilcox

On Thu, Apr 06, 2017 at 03:14:52PM -0400, Jeff Layton wrote:
> @@ -868,6 +869,7 @@ struct file {
>   struct list_headf_tfile_llink;
>  #endif /* #ifdef CONFIG_EPOLL */
>   struct address_space*f_mapping;
> + u32 f_wb_err;
>  } __attribute__((aligned(4)));   /* lest something weird decides that 2 
> is OK */
>  

I think we can squeeze that in next to f_flags?

> +/**
> + * filemap_set_wb_error - set the wb error in the mapping for later reporting
> + * @mapping: mapping in which the error should be set
> + * @err: error to set. must be negative value but not less than -MAX_ERRNO

Do we want to have users call filemap_set_wb_error(mapping, EIO)
or filemap_set_wb_error(mapping, -EIO)?  Either way, we can assert
that it's in the correct range (oh look, we have at least one user of
mapping_set_error calling it with a positive errno ...)

I've been playing with positive or negative errnos for the xarray, and
positive looks better to me, although there's a definite advantage to
being able to just call filemap_set_wb_error(mapping, result).

#define XAS_ERROR(errno)((struct xa_node *)((errno << 1) | 1))

static inline int xas_error(const struct xa_state *xas)
{
unsigned long v = (unsigned long)xas->xa_node;
return (v & 1) ? -(v >> 1) : 0;
}

static inline void xas_set_err(struct xa_state *xas, unsigned long err)
{
XA_BUG_ON(err > MAX_ERRNO);
xas->xa_node = XAS_ERROR(err);
}

> + /*
> +  * Ensure the error code actually fits where we want it to go. If it
> +  * doesn't then just throw a warning and don't record anything.
> +  */
> + if (unlikely(err > 0 || err < -MAX_ERRNO)) {
> + WARN(1, "err=%d\n", err);
> + return;
> + }

Cute trick to make this more succinct:

if (WARN(err > 0 || err < -MAX_ERRNO), "err = %d\n", err)
return;
or even ...

if (WARN((unsigned int)-err > MAX_ERRNO), "err = %d\n", err)
return;

> + /* Clear out error bits and set new error */
> + new = (old & ~MAX_ERRNO) | -err;
> +
> + /* Only increment if someone has looked at it */
> + if (old & WB_ERR_SEEN) {
> + new += WB_ERR_CTR_INC;
> + new &= ~WB_ERR_SEEN;
> + }

Although we always want to clear out the SEEN bit if we're updating ... so

new = (old & ~(MAX_ERRNO | WB_ERR_SEEN) | -err;

/* Only increment if someone has looked at it */
if (old & WB_ERR_SEEN)
new += WB_ERR_CTR_INC;

... and then there's no need to update if it's the same errno and nobody's
seen it:

if (old == new)
break;

[...]

> + /*
> +  * We always store values with the "seen" bit set, so if this
> +  * matches what we already have, then we can call it done.
> +  * There is nothing to update so just return 0.
> +  */
> + if (old == file->f_wb_err)
> + break;
> +
> + /* set flag and try to swap it into place */
> + new = old | WB_ERR_SEEN;

Again, I think we should avoid the cmpxchg with:

if (old == new)
break;

> + cur = cmpxchg(>wb_err, old, new);
> +
> + /*
> +  * We can quit now if we successfully swapped in the new value
> +  * or someone else beat us to it with the same value that we
> +  * were planning to store.
> +  */
> + if (likely(cur == old || cur == new)) {
> + file->f_wb_err = new;
> + err = -(new & MAX_ERRNO);
> + break;
> + }
> +
> + /* Raced with an update, try again */
> + old = cur;

Well ... should we?  We're returning an error which is new to this fd anyway.
Do we want to return the most recent error by a nanosecond, or should we
return the previous one and then see this one next time we call fsync()?

I'd lean towards not looping here; not even looking at 'cur'.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Jeff Layton

On Thu, 2017-04-06 at 10:02 +1000, NeilBrown wrote:
> > 

> On Thu, Apr 06 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > > > That said, I think giving more specific errors where we can is useful.
> > > > When your program is erroring out and writing 'I/O error' to the logs,
> > > > then how much time will your admins burn before they figure out that it
> > > > really failed because the filesystem was full?
> > > 
> > > df is one of the first things I check ... a few years ago, I also learned
> > > to check df -i ... ;-)
> > > 
> > > Anyway, given the decision to simply report the last error lets us do this
> > > implementation:
> > > 
> > > void filemap_set_wb_error(struct address_space *mapping, int err)
> > > {
> > >   struct inode *inode = mapping->host;
> > >   unsigned int wb_err;
> > > 
> > >   if (!err)
> > >   return;
> > >   /*
> > >* This should be called with the error code that we want to return
> > >* on fsync. Thus, it should always be <= 0.
> > >*/
> > >   WARN_ON(err > 0 || err < -MAX_ERRNO);
> > > 
> > >   spin_lock(>i_lock);
> > >   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> > >   WRITE_ONCE(mapping->wb_err, wb_err);
> > >   spin_unlock(>i_lock);
> > > }
> > > 
> > 
> > I like this idea of being able to store arbitrary error codes there.
> > That should be used judiciously of course, but we already allow
> > returning arbitrary errors via the ->fsync op anyway.
> > 
> > I'll plan to incorporate something like that into the next set (with
> > judicious comments and constants).
> > 
> > One question...is the i_lock the right way to protect this? I think we
> > could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> > worried about performance here -- it's just nice to be able to call
> > simple stuff like this without worrying about locking.
> 
> I like the idea of using cmpxchg.
> 
> 
> > 
> > > int filemap_report_wb_error(struct file *file)
> > > {
> > >   struct inode *inode = file_inode(file);
> > >   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> > > 
> > >   if (file->f_wb_err == wb_err)
> > >   return 0;
> > >   return -(wb_err & 4095);
> > > }
> > > 
> > > That only gives us 20 bits of counter, but I think that's enough.
> > 
> > 2^20 is 1048576, which seems a little small to me.
> > 
> > We may end up bumping the counter on every failed I/O. How fast can we
> > generate 1M failed I/Os? :)
> 
> Do we need to count all of those if no-one sees them?
> i.e. use one bit to say "this error hasn't been seen".
> If an error occurs with has the name error code as is currently stored,
> and the bit is set, don't make a change.  Otherwise make the change,
> inc the counter, set the bit.
> When checking for an error, if the bit is set, clear it first.
> Then you can count 500,000 errors-returned-to-some-thread, which is
> probably enough.
> 

Ok, so here's a replacement for patch #1. The other 3 are pretty much
the same. The main changes are:

- 32 bit value:
  - 12 bits for error code
  - 1 bit for "seen" flag
  - 19 bits for the counter
- mapping->wb_err is managed with cmpxchg
- file->f_wb_err is protected with file->f_lock

I tried to avoid updating things unnecesssarily. I could use some
guidance on how to specify the constants in terms of MAX_ERRNO as well.

It seems to work, in very basic by-hand testing.

If this looks reasonable, I may try again to plug this in at a higher
level, so we don't need to change so much filesystem code. IOW:

- make filemap_set_wb_error the new implementation of mapping_set_error
- have vfs_fsync_range call filemap_report_wb_error, and return what it
  returns if it's non-zero
- have filemap_check_error grab the current error code without updating
  the counter or the seen flag

That approach may not work, but I'll see. Anyway, here's the updated
patch. I may need to revise the changelog too.

--8<-

[PATCH] fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, how do I know that
was because writeback failed?

This patch adds a small bit of new infrastructure for setting and
reporting errors during pagecache writeback. While the above was my

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Jeff Layton

On Thu, 2017-04-06 at 10:02 +1000, NeilBrown wrote:
> > 

> On Thu, Apr 06 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > > > That said, I think giving more specific errors where we can is useful.
> > > > When your program is erroring out and writing 'I/O error' to the logs,
> > > > then how much time will your admins burn before they figure out that it
> > > > really failed because the filesystem was full?
> > > 
> > > df is one of the first things I check ... a few years ago, I also learned
> > > to check df -i ... ;-)
> > > 
> > > Anyway, given the decision to simply report the last error lets us do this
> > > implementation:
> > > 
> > > void filemap_set_wb_error(struct address_space *mapping, int err)
> > > {
> > >   struct inode *inode = mapping->host;
> > >   unsigned int wb_err;
> > > 
> > >   if (!err)
> > >   return;
> > >   /*
> > >* This should be called with the error code that we want to return
> > >* on fsync. Thus, it should always be <= 0.
> > >*/
> > >   WARN_ON(err > 0 || err < -MAX_ERRNO);
> > > 
> > >   spin_lock(>i_lock);
> > >   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> > >   WRITE_ONCE(mapping->wb_err, wb_err);
> > >   spin_unlock(>i_lock);
> > > }
> > > 
> > 
> > I like this idea of being able to store arbitrary error codes there.
> > That should be used judiciously of course, but we already allow
> > returning arbitrary errors via the ->fsync op anyway.
> > 
> > I'll plan to incorporate something like that into the next set (with
> > judicious comments and constants).
> > 
> > One question...is the i_lock the right way to protect this? I think we
> > could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> > worried about performance here -- it's just nice to be able to call
> > simple stuff like this without worrying about locking.
> 
> I like the idea of using cmpxchg.
> 
> 
> > 
> > > int filemap_report_wb_error(struct file *file)
> > > {
> > >   struct inode *inode = file_inode(file);
> > >   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> > > 
> > >   if (file->f_wb_err == wb_err)
> > >   return 0;
> > >   return -(wb_err & 4095);
> > > }
> > > 
> > > That only gives us 20 bits of counter, but I think that's enough.
> > 
> > 2^20 is 1048576, which seems a little small to me.
> > 
> > We may end up bumping the counter on every failed I/O. How fast can we
> > generate 1M failed I/Os? :)
> 
> Do we need to count all of those if no-one sees them?
> i.e. use one bit to say "this error hasn't been seen".
> If an error occurs with has the name error code as is currently stored,
> and the bit is set, don't make a change.  Otherwise make the change,
> inc the counter, set the bit.
> When checking for an error, if the bit is set, clear it first.
> Then you can count 500,000 errors-returned-to-some-thread, which is
> probably enough.
> 

Ok, so here's a replacement for patch #1. The other 3 are pretty much
the same. The main changes are:

- 32 bit value:
  - 12 bits for error code
  - 1 bit for "seen" flag
  - 19 bits for the counter
- mapping->wb_err is managed with cmpxchg
- file->f_wb_err is protected with file->f_lock

I tried to avoid updating things unnecesssarily. I could use some
guidance on how to specify the constants in terms of MAX_ERRNO as well.

It seems to work, in very basic by-hand testing.

If this looks reasonable, I may try again to plug this in at a higher
level, so we don't need to change so much filesystem code. IOW:

- make filemap_set_wb_error the new implementation of mapping_set_error
- have vfs_fsync_range call filemap_report_wb_error, and return what it
  returns if it's non-zero
- have filemap_check_error grab the current error code without updating
  the counter or the seen flag

That approach may not work, but I'll see. Anyway, here's the updated
patch. I may need to revise the changelog too.

--8<-

[PATCH] fs: new infrastructure for writeback error handling and reporting

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

It's those non-fsync callers that are problematic. We should be
reporting writeback errors during fsync, but many places in the code
clear out errors before they can be properly reported, or report errors
at nonsensical times. If I get -EIO on a stat() call, how do I know that
was because writeback failed?

This patch adds a small bit of new infrastructure for setting and
reporting errors during pagecache writeback. While the above was my

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Jeff Layton

On Thu, 2017-04-06 at 10:02 +1000, NeilBrown wrote:
> On Thu, Apr 06 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > > > That said, I think giving more specific errors where we can is useful.
> > > > When your program is erroring out and writing 'I/O error' to the logs,
> > > > then how much time will your admins burn before they figure out that it
> > > > really failed because the filesystem was full?
> > > 
> > > df is one of the first things I check ... a few years ago, I also learned
> > > to check df -i ... ;-)
> > > 
> > > Anyway, given the decision to simply report the last error lets us do this
> > > implementation:
> > > 
> > > void filemap_set_wb_error(struct address_space *mapping, int err)
> > > {
> > >   struct inode *inode = mapping->host;
> > >   unsigned int wb_err;
> > > 
> > >   if (!err)
> > >   return;
> > >   /*
> > >* This should be called with the error code that we want to return
> > >* on fsync. Thus, it should always be <= 0.
> > >*/
> > >   WARN_ON(err > 0 || err < -MAX_ERRNO);
> > > 
> > >   spin_lock(>i_lock);
> > >   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> > >   WRITE_ONCE(mapping->wb_err, wb_err);
> > >   spin_unlock(>i_lock);
> > > }
> > > 
> > 
> > I like this idea of being able to store arbitrary error codes there.
> > That should be used judiciously of course, but we already allow
> > returning arbitrary errors via the ->fsync op anyway.
> > 
> > I'll plan to incorporate something like that into the next set (with
> > judicious comments and constants).
> > 
> > One question...is the i_lock the right way to protect this? I think we
> > could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> > worried about performance here -- it's just nice to be able to call
> > simple stuff like this without worrying about locking.
> 
> I like the idea of using cmpxchg.
> 
> 
> > 
> > > int filemap_report_wb_error(struct file *file)
> > > {
> > >   struct inode *inode = file_inode(file);
> > >   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> > > 
> > >   if (file->f_wb_err == wb_err)
> > >   return 0;
> > >   return -(wb_err & 4095);
> > > }
> > > 
> > > That only gives us 20 bits of counter, but I think that's enough.
> > 
> > 2^20 is 1048576, which seems a little small to me.
> > 
> > We may end up bumping the counter on every failed I/O. How fast can we
> > generate 1M failed I/Os? :)
> 
> Do we need to count all of those if no-one sees them?
> i.e. use one bit to say "this error hasn't been seen".
> If an error occurs with has the name error code as is currently stored,
> and the bit is set, don't make a change.  Otherwise make the change,
> inc the counter, set the bit.
> When checking for an error, if the bit is set, clear it first.
> Then you can count 500,000 errors-returned-to-some-thread, which is
> probably enough.
> 

Yeah, that seems like it might be a good idea if we want to stick to a
small value here.

> > 
> > 2^52 however is 4503599627370496 (4Tios or so) ... that might take a
> > little longer to overflow. Is it worth the cost here to ensure that
> > this won't occur?
> > 
> > Actually...we could put this field in the inode instead of the mapping.
> > I know we've traditionally tracked this in the mapping, but is that
> > required here?
> 
> What if the address_space is shared by two inodes?  That is the whole
> point of the i_mapping pointer.  This would make it harder for the
> "other" inode to get the error.
> (Does anyone actually use fs/coda ??
>  Actually, block devices use i_mapping too.
>  If two block device inodes have the same major/minor number, they
>  end up having i_mapping point to the same place)
> 

Ahh ok, that makes sense. I'll plan to keep it part of the mapping.

> If you are concerned about space in 'struct address_space', just prune
> some wastage.
> The "host" field brings no value.  It is only ever assigned in
> inode_init_always():
> 
>   struct address_space *const mapping = >i_data;
> ..
>   mapping->host = inode;
> 
> So you could change all references to use
>container_of(mapping, struct inode, i_data)
> 

That might be a nice cleanup, but I think I'll leave that to be done
separately.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Jeff Layton

On Thu, 2017-04-06 at 10:02 +1000, NeilBrown wrote:
> On Thu, Apr 06 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > > > That said, I think giving more specific errors where we can is useful.
> > > > When your program is erroring out and writing 'I/O error' to the logs,
> > > > then how much time will your admins burn before they figure out that it
> > > > really failed because the filesystem was full?
> > > 
> > > df is one of the first things I check ... a few years ago, I also learned
> > > to check df -i ... ;-)
> > > 
> > > Anyway, given the decision to simply report the last error lets us do this
> > > implementation:
> > > 
> > > void filemap_set_wb_error(struct address_space *mapping, int err)
> > > {
> > >   struct inode *inode = mapping->host;
> > >   unsigned int wb_err;
> > > 
> > >   if (!err)
> > >   return;
> > >   /*
> > >* This should be called with the error code that we want to return
> > >* on fsync. Thus, it should always be <= 0.
> > >*/
> > >   WARN_ON(err > 0 || err < -MAX_ERRNO);
> > > 
> > >   spin_lock(>i_lock);
> > >   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
> > >   WRITE_ONCE(mapping->wb_err, wb_err);
> > >   spin_unlock(>i_lock);
> > > }
> > > 
> > 
> > I like this idea of being able to store arbitrary error codes there.
> > That should be used judiciously of course, but we already allow
> > returning arbitrary errors via the ->fsync op anyway.
> > 
> > I'll plan to incorporate something like that into the next set (with
> > judicious comments and constants).
> > 
> > One question...is the i_lock the right way to protect this? I think we
> > could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> > worried about performance here -- it's just nice to be able to call
> > simple stuff like this without worrying about locking.
> 
> I like the idea of using cmpxchg.
> 
> 
> > 
> > > int filemap_report_wb_error(struct file *file)
> > > {
> > >   struct inode *inode = file_inode(file);
> > >   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> > > 
> > >   if (file->f_wb_err == wb_err)
> > >   return 0;
> > >   return -(wb_err & 4095);
> > > }
> > > 
> > > That only gives us 20 bits of counter, but I think that's enough.
> > 
> > 2^20 is 1048576, which seems a little small to me.
> > 
> > We may end up bumping the counter on every failed I/O. How fast can we
> > generate 1M failed I/Os? :)
> 
> Do we need to count all of those if no-one sees them?
> i.e. use one bit to say "this error hasn't been seen".
> If an error occurs with has the name error code as is currently stored,
> and the bit is set, don't make a change.  Otherwise make the change,
> inc the counter, set the bit.
> When checking for an error, if the bit is set, clear it first.
> Then you can count 500,000 errors-returned-to-some-thread, which is
> probably enough.
> 

Yeah, that seems like it might be a good idea if we want to stick to a
small value here.

> > 
> > 2^52 however is 4503599627370496 (4Tios or so) ... that might take a
> > little longer to overflow. Is it worth the cost here to ensure that
> > this won't occur?
> > 
> > Actually...we could put this field in the inode instead of the mapping.
> > I know we've traditionally tracked this in the mapping, but is that
> > required here?
> 
> What if the address_space is shared by two inodes?  That is the whole
> point of the i_mapping pointer.  This would make it harder for the
> "other" inode to get the error.
> (Does anyone actually use fs/coda ??
>  Actually, block devices use i_mapping too.
>  If two block device inodes have the same major/minor number, they
>  end up having i_mapping point to the same place)
> 

Ahh ok, that makes sense. I'll plan to keep it part of the mapping.

> If you are concerned about space in 'struct address_space', just prune
> some wastage.
> The "host" field brings no value.  It is only ever assigned in
> inode_init_always():
> 
>   struct address_space *const mapping = >i_data;
> ..
>   mapping->host = inode;
> 
> So you could change all references to use
>container_of(mapping, struct inode, i_data)
> 

That might be a nice cleanup, but I think I'll leave that to be done
separately.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Matthew Wilcox

On Thu, Apr 06, 2017 at 03:12:42PM +1000, NeilBrown wrote:
> On Wed, Apr 05 2017, Matthew Wilcox wrote:
> 
> > On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
> >> If you are concerned about space in 'struct address_space', just prune
> >> some wastage.
> >
> > I'm trying to (via wlists).  still buggy though.
> 
> Cool.
> (I wonder what a wlist is weighted list?)

A homonym ;-)  Waitlists.  Yet Another Variation of a doubly linked list.
This variation has only a single head pointer (like the hlist), allows
O(1) deletion from any point in the list (but requires that you know the
head of the list).  The tail of the list is pointed to by head->prev.

http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/wlist

It also needs updating to the latest tree; WW locks got revamped.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-06 Thread Matthew Wilcox

On Thu, Apr 06, 2017 at 03:12:42PM +1000, NeilBrown wrote:
> On Wed, Apr 05 2017, Matthew Wilcox wrote:
> 
> > On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
> >> If you are concerned about space in 'struct address_space', just prune
> >> some wastage.
> >
> > I'm trying to (via wlists).  still buggy though.
> 
> Cool.
> (I wonder what a wlist is weighted list?)

A homonym ;-)  Waitlists.  Yet Another Variation of a doubly linked list.
This variation has only a single head pointer (like the hlist), allows
O(1) deletion from any point in the list (but requires that you know the
head of the list).  The tail of the list is pointed to by head->prev.

http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/wlist

It also needs updating to the latest tree; WW locks got revamped.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Matthew Wilcox wrote:

> On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
>> If you are concerned about space in 'struct address_space', just prune
>> some wastage.
>
> I'm trying to (via wlists).  still buggy though.

Cool.
(I wonder what a wlist is weighted list?)

>
>> The "host" field brings no value.  It is only ever assigned in
>> inode_init_always():
>> 
>>  struct address_space *const mapping = >i_data;
>> ..
>>  mapping->host = inode;
>> 
>> So you could change all references to use
>>container_of(mapping, struct inode, i_data)
>
> Alas, no:
>
> drivers/dax/dax.c:  inode->i_mapping->host = dax_dev->inode;

inode->i_mapping = dax_dev->inode->i_mapping;
inode->i_mapping->host = dax_dev->inode;
so that second line is equivalent to
dax_dev->inode->i_mapping->host = dax_dev->inode;
so inode->mapping->host leads back to inode.  So this doesn't break the
invariant.

> fs/gfs2/glock.c:mapping->host = s->s_bdev->bd_inode;
> fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;

Hmm.. that's weird.  I cannot quite follow what is happening there.
It creates an address-space for metadata which doesn't have a real
inode, and borrows bits of the bdev inode ... possibly just to be able
to find the blocksize deep in buffer.c or similar.
I suspect that using an 'inode' instead of a 'mapping' would make the
code clearer.

> fs/nilfs2/page.c:   mapping->host = inode;

A nilfs inode is allocated with 2 address spaces,
one for the data and one for btree indexing metadata.
And then there are a couple of extra address spaces for the
global metadata-file (mtd).

I wonder what the ->host pointer is actually used for.
buffer.c uses it:
 - to mark the inode 'dirty' when the page is marked dirty
 - to find the blocksize of the inode, for creating buffer_heads
 - find the size of the mapping (i_size)

I could probably argue that the 'dirty' flag (at least for the data) and
the size really belong in the address_space, not in the inode.
The blocksize, I'm less sure of.

I suspect gfs2 and nilfs2 could be changed to allocate a separate inode
(instead of address_space), or to not make use of the ->host pointer.
It would be more work than I at first thought though.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Matthew Wilcox wrote:

> On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
>> If you are concerned about space in 'struct address_space', just prune
>> some wastage.
>
> I'm trying to (via wlists).  still buggy though.

Cool.
(I wonder what a wlist is weighted list?)

>
>> The "host" field brings no value.  It is only ever assigned in
>> inode_init_always():
>> 
>>  struct address_space *const mapping = >i_data;
>> ..
>>  mapping->host = inode;
>> 
>> So you could change all references to use
>>container_of(mapping, struct inode, i_data)
>
> Alas, no:
>
> drivers/dax/dax.c:  inode->i_mapping->host = dax_dev->inode;

inode->i_mapping = dax_dev->inode->i_mapping;
inode->i_mapping->host = dax_dev->inode;
so that second line is equivalent to
dax_dev->inode->i_mapping->host = dax_dev->inode;
so inode->mapping->host leads back to inode.  So this doesn't break the
invariant.

> fs/gfs2/glock.c:mapping->host = s->s_bdev->bd_inode;
> fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;

Hmm.. that's weird.  I cannot quite follow what is happening there.
It creates an address-space for metadata which doesn't have a real
inode, and borrows bits of the bdev inode ... possibly just to be able
to find the blocksize deep in buffer.c or similar.
I suspect that using an 'inode' instead of a 'mapping' would make the
code clearer.

> fs/nilfs2/page.c:   mapping->host = inode;

A nilfs inode is allocated with 2 address spaces,
one for the data and one for btree indexing metadata.
And then there are a couple of extra address spaces for the
global metadata-file (mtd).

I wonder what the ->host pointer is actually used for.
buffer.c uses it:
 - to mark the inode 'dirty' when the page is marked dirty
 - to find the blocksize of the inode, for creating buffer_heads
 - find the size of the mapping (i_size)

I could probably argue that the 'dirty' flag (at least for the data) and
the size really belong in the address_space, not in the inode.
The blocksize, I'm less sure of.

I suspect gfs2 and nilfs2 could be changed to allocate a separate inode
(instead of address_space), or to not make use of the ->host pointer.
It would be more work than I at first thought though.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Matthew Wilcox

On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
> If you are concerned about space in 'struct address_space', just prune
> some wastage.

I'm trying to (via wlists).  still buggy though.

> The "host" field brings no value.  It is only ever assigned in
> inode_init_always():
> 
>   struct address_space *const mapping = >i_data;
> ..
>   mapping->host = inode;
> 
> So you could change all references to use
>container_of(mapping, struct inode, i_data)

Alas, no:

drivers/dax/dax.c:  inode->i_mapping->host = dax_dev->inode;
fs/gfs2/glock.c:mapping->host = s->s_bdev->bd_inode;
fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;
fs/nilfs2/page.c:   mapping->host = inode;

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Matthew Wilcox

On Thu, Apr 06, 2017 at 10:02:48AM +1000, NeilBrown wrote:
> If you are concerned about space in 'struct address_space', just prune
> some wastage.

I'm trying to (via wlists).  still buggy though.

> The "host" field brings no value.  It is only ever assigned in
> inode_init_always():
> 
>   struct address_space *const mapping = >i_data;
> ..
>   mapping->host = inode;
> 
> So you could change all references to use
>container_of(mapping, struct inode, i_data)

Alas, no:

drivers/dax/dax.c:  inode->i_mapping->host = dax_dev->inode;
fs/gfs2/glock.c:mapping->host = s->s_bdev->bd_inode;
fs/gfs2/ops_fstype.c:   mapping->host = sb->s_bdev->bd_inode;
fs/nilfs2/page.c:   mapping->host = inode;

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Jeff Layton wrote:

>> 
>> O_DIRECT write() can get an EIO from a previous write-back write to the
>> same file.  Maybe non-O_DIRECT writes should too?
>> 
>
> Some already do this for buffered writes.
>
> This is really a philosophical question, IMO...is it correct to return
> an error on a write call, due to writeback failing previously or during
> the write call, quite possibly to a range that the write call does not
> touch? I can see an argument either way for this.

I like the "we already do" argument.

>
> Also, if we do think that returning an error on the write is the right
> thing to do, should that error advance the sequence counter in the
> struct file, such that an fsync afterward gets back 0? My feeling here
> is that fsync should still report an error after a failed write, but
> maybe that's wrong?

My first thought was that one the error has been returned to any
syscall on a given fd, it has been returned.  Once is enough.
My second thought was that maybe your feeling is right.  Having a well
defined error-return point in fsync feels like a nice design.
My third thought was that this would mean either
 - write continues to fail until fsync is called (probably bad), or
 - we need two counters per "struct file", one for fsync, one for write.
   I don't like that much.

So I'm going back to my first thought.

Thanks,
NeilBrown


>
> This is certainly one area where switching to synchronous writes on
> error would make things a little simpler.
> -- 
> Jeff Layton 


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Jeff Layton wrote:

>> 
>> O_DIRECT write() can get an EIO from a previous write-back write to the
>> same file.  Maybe non-O_DIRECT writes should too?
>> 
>
> Some already do this for buffered writes.
>
> This is really a philosophical question, IMO...is it correct to return
> an error on a write call, due to writeback failing previously or during
> the write call, quite possibly to a range that the write call does not
> touch? I can see an argument either way for this.

I like the "we already do" argument.

>
> Also, if we do think that returning an error on the write is the right
> thing to do, should that error advance the sequence counter in the
> struct file, such that an fsync afterward gets back 0? My feeling here
> is that fsync should still report an error after a failed write, but
> maybe that's wrong?

My first thought was that one the error has been returned to any
syscall on a given fd, it has been returned.  Once is enough.
My second thought was that maybe your feeling is right.  Having a well
defined error-return point in fsync feels like a nice design.
My third thought was that this would mean either
 - write continues to fail until fsync is called (probably bad), or
 - we need two counters per "struct file", one for fsync, one for write.
   I don't like that much.

So I'm going back to my first thought.

Thanks,
NeilBrown


>
> This is certainly one area where switching to synchronous writes on
> error would make things a little simpler.
> -- 
> Jeff Layton 


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Matthew Wilcox wrote:

> On Wed, Apr 05, 2017 at 03:49:52PM -0400, Jeff Layton wrote:
>> > That only gives us 20 bits of counter, but I think that's enough.
>> 
>> 2^20 is 1048576, which seems a little small to me.
>> 
>> We may end up bumping the counter on every failed I/O. How fast can we
>> generate 1M failed I/Os? :)
>
> So there's a one-in-a-million chance of missing a failed I/O ... if
> we're generating lots of errors, the next time the app calls fsync(),
> it'll notice the other million times we've hit the problem :-)
>
>> Actually...we could put this field in the inode instead of the mapping.
>> I know we've traditionally tracked this in the mapping, but is that
>> required here?
>> 
>> If we put this field in the inode then perhaps we can union it with
>> something and mitigate the cost of a larger counter...maybe in the
>> i_pipe union? I don't think S_ISREG inodes use anything in there, do
>> they?
>
> But writeback isn't just done on ISREG inodes, but also on S_ISBLK inodes,
> which use i_bdev (right?)
>
> Another possibility is to move this out of the address_space and into
> either the super_block or the backing_device_info.  Errors don't tend
> to be constrained to a single file but affect the entire filesystem,
> or even multiple filesystems if you have a partitioned block device ...

EDQUOT.  Remember EDQUOT.  It certainly don't affect the whole
filesystem.
Even without that, filesystems can easily treat different files
differently.  We shouldn't assume one-failes-all-fail.

NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Wed, Apr 05 2017, Matthew Wilcox wrote:

> On Wed, Apr 05, 2017 at 03:49:52PM -0400, Jeff Layton wrote:
>> > That only gives us 20 bits of counter, but I think that's enough.
>> 
>> 2^20 is 1048576, which seems a little small to me.
>> 
>> We may end up bumping the counter on every failed I/O. How fast can we
>> generate 1M failed I/Os? :)
>
> So there's a one-in-a-million chance of missing a failed I/O ... if
> we're generating lots of errors, the next time the app calls fsync(),
> it'll notice the other million times we've hit the problem :-)
>
>> Actually...we could put this field in the inode instead of the mapping.
>> I know we've traditionally tracked this in the mapping, but is that
>> required here?
>> 
>> If we put this field in the inode then perhaps we can union it with
>> something and mitigate the cost of a larger counter...maybe in the
>> i_pipe union? I don't think S_ISREG inodes use anything in there, do
>> they?
>
> But writeback isn't just done on ISREG inodes, but also on S_ISBLK inodes,
> which use i_bdev (right?)
>
> Another possibility is to move this out of the address_space and into
> either the super_block or the backing_device_info.  Errors don't tend
> to be constrained to a single file but affect the entire filesystem,
> or even multiple filesystems if you have a partitioned block device ...

EDQUOT.  Remember EDQUOT.  It certainly don't affect the whole
filesystem.
Even without that, filesystems can easily treat different files
differently.  We shouldn't assume one-failes-all-fail.

NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Thu, Apr 06 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
>> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
>> > That said, I think giving more specific errors where we can is useful.
>> > When your program is erroring out and writing 'I/O error' to the logs,
>> > then how much time will your admins burn before they figure out that it
>> > really failed because the filesystem was full?
>> 
>> df is one of the first things I check ... a few years ago, I also learned
>> to check df -i ... ;-)
>> 
>> Anyway, given the decision to simply report the last error lets us do this
>> implementation:
>> 
>> void filemap_set_wb_error(struct address_space *mapping, int err)
>> {
>>  struct inode *inode = mapping->host;
>>  unsigned int wb_err;
>> 
>>  if (!err)
>>  return;
>>  /*
>>   * This should be called with the error code that we want to return
>>   * on fsync. Thus, it should always be <= 0.
>>   */
>>  WARN_ON(err > 0 || err < -MAX_ERRNO);
>> 
>>  spin_lock(>i_lock);
>>  wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>>  WRITE_ONCE(mapping->wb_err, wb_err);
>>  spin_unlock(>i_lock);
>> }
>> 
>
> I like this idea of being able to store arbitrary error codes there.
> That should be used judiciously of course, but we already allow
> returning arbitrary errors via the ->fsync op anyway.
>
> I'll plan to incorporate something like that into the next set (with
> judicious comments and constants).
>
> One question...is the i_lock the right way to protect this? I think we
> could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> worried about performance here -- it's just nice to be able to call
> simple stuff like this without worrying about locking.

I like the idea of using cmpxchg.


>
>> int filemap_report_wb_error(struct file *file)
>> {
>>  struct inode *inode = file_inode(file);
>>  unsigned int wb_err = READ_ONCE(mapping->wb_err);
>> 
>>  if (file->f_wb_err == wb_err)
>>  return 0;
>>  return -(wb_err & 4095);
>> }
>> 
>> That only gives us 20 bits of counter, but I think that's enough.
>
> 2^20 is 1048576, which seems a little small to me.
>
> We may end up bumping the counter on every failed I/O. How fast can we
> generate 1M failed I/Os? :)

Do we need to count all of those if no-one sees them?
i.e. use one bit to say "this error hasn't been seen".
If an error occurs with has the name error code as is currently stored,
and the bit is set, don't make a change.  Otherwise make the change,
inc the counter, set the bit.
When checking for an error, if the bit is set, clear it first.
Then you can count 500,000 errors-returned-to-some-thread, which is
probably enough.

>
> 2^52 however is 4503599627370496 (4Tios or so) ... that might take a
> little longer to overflow. Is it worth the cost here to ensure that
> this won't occur?
>
> Actually...we could put this field in the inode instead of the mapping.
> I know we've traditionally tracked this in the mapping, but is that
> required here?

What if the address_space is shared by two inodes?  That is the whole
point of the i_mapping pointer.  This would make it harder for the
"other" inode to get the error.
(Does anyone actually use fs/coda ??
 Actually, block devices use i_mapping too.
 If two block device inodes have the same major/minor number, they
 end up having i_mapping point to the same place)

If you are concerned about space in 'struct address_space', just prune
some wastage.
The "host" field brings no value.  It is only ever assigned in
inode_init_always():

struct address_space *const mapping = >i_data;
..
mapping->host = inode;

So you could change all references to use
   container_of(mapping, struct inode, i_data)

NeilBrown



>
> If we put this field in the inode then perhaps we can union it with
> something and mitigate the cost of a larger counter...maybe in the
> i_pipe union? I don't think S_ISREG inodes use anything in there, do
> they?
> -- 
> Jeff Layton 


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread NeilBrown

On Thu, Apr 06 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
>> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
>> > That said, I think giving more specific errors where we can is useful.
>> > When your program is erroring out and writing 'I/O error' to the logs,
>> > then how much time will your admins burn before they figure out that it
>> > really failed because the filesystem was full?
>> 
>> df is one of the first things I check ... a few years ago, I also learned
>> to check df -i ... ;-)
>> 
>> Anyway, given the decision to simply report the last error lets us do this
>> implementation:
>> 
>> void filemap_set_wb_error(struct address_space *mapping, int err)
>> {
>>  struct inode *inode = mapping->host;
>>  unsigned int wb_err;
>> 
>>  if (!err)
>>  return;
>>  /*
>>   * This should be called with the error code that we want to return
>>   * on fsync. Thus, it should always be <= 0.
>>   */
>>  WARN_ON(err > 0 || err < -MAX_ERRNO);
>> 
>>  spin_lock(>i_lock);
>>  wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>>  WRITE_ONCE(mapping->wb_err, wb_err);
>>  spin_unlock(>i_lock);
>> }
>> 
>
> I like this idea of being able to store arbitrary error codes there.
> That should be used judiciously of course, but we already allow
> returning arbitrary errors via the ->fsync op anyway.
>
> I'll plan to incorporate something like that into the next set (with
> judicious comments and constants).
>
> One question...is the i_lock the right way to protect this? I think we
> could do this locklessly too (cmpxchg in a loop, for instance). I'm not
> worried about performance here -- it's just nice to be able to call
> simple stuff like this without worrying about locking.

I like the idea of using cmpxchg.


>
>> int filemap_report_wb_error(struct file *file)
>> {
>>  struct inode *inode = file_inode(file);
>>  unsigned int wb_err = READ_ONCE(mapping->wb_err);
>> 
>>  if (file->f_wb_err == wb_err)
>>  return 0;
>>  return -(wb_err & 4095);
>> }
>> 
>> That only gives us 20 bits of counter, but I think that's enough.
>
> 2^20 is 1048576, which seems a little small to me.
>
> We may end up bumping the counter on every failed I/O. How fast can we
> generate 1M failed I/Os? :)

Do we need to count all of those if no-one sees them?
i.e. use one bit to say "this error hasn't been seen".
If an error occurs with has the name error code as is currently stored,
and the bit is set, don't make a change.  Otherwise make the change,
inc the counter, set the bit.
When checking for an error, if the bit is set, clear it first.
Then you can count 500,000 errors-returned-to-some-thread, which is
probably enough.

>
> 2^52 however is 4503599627370496 (4Tios or so) ... that might take a
> little longer to overflow. Is it worth the cost here to ensure that
> this won't occur?
>
> Actually...we could put this field in the inode instead of the mapping.
> I know we've traditionally tracked this in the mapping, but is that
> required here?

What if the address_space is shared by two inodes?  That is the whole
point of the i_mapping pointer.  This would make it harder for the
"other" inode to get the error.
(Does anyone actually use fs/coda ??
 Actually, block devices use i_mapping too.
 If two block device inodes have the same major/minor number, they
 end up having i_mapping point to the same place)

If you are concerned about space in 'struct address_space', just prune
some wastage.
The "host" field brings no value.  It is only ever assigned in
inode_init_always():

struct address_space *const mapping = >i_data;
..
mapping->host = inode;

So you could change all references to use
   container_of(mapping, struct inode, i_data)

NeilBrown



>
> If we put this field in the inode then perhaps we can union it with
> something and mitigate the cost of a larger counter...maybe in the
> i_pipe union? I don't think S_ISREG inodes use anything in there, do
> they?
> -- 
> Jeff Layton 


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Matthew Wilcox

On Wed, Apr 05, 2017 at 03:49:52PM -0400, Jeff Layton wrote:
> > That only gives us 20 bits of counter, but I think that's enough.
> 
> 2^20 is 1048576, which seems a little small to me.
> 
> We may end up bumping the counter on every failed I/O. How fast can we
> generate 1M failed I/Os? :)

So there's a one-in-a-million chance of missing a failed I/O ... if
we're generating lots of errors, the next time the app calls fsync(),
it'll notice the other million times we've hit the problem :-)

> Actually...we could put this field in the inode instead of the mapping.
> I know we've traditionally tracked this in the mapping, but is that
> required here?
> 
> If we put this field in the inode then perhaps we can union it with
> something and mitigate the cost of a larger counter...maybe in the
> i_pipe union? I don't think S_ISREG inodes use anything in there, do
> they?

But writeback isn't just done on ISREG inodes, but also on S_ISBLK inodes,
which use i_bdev (right?)

Another possibility is to move this out of the address_space and into
either the super_block or the backing_device_info.  Errors don't tend
to be constrained to a single file but affect the entire filesystem,
or even multiple filesystems if you have a partitioned block device ...

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Matthew Wilcox

On Wed, Apr 05, 2017 at 03:49:52PM -0400, Jeff Layton wrote:
> > That only gives us 20 bits of counter, but I think that's enough.
> 
> 2^20 is 1048576, which seems a little small to me.
> 
> We may end up bumping the counter on every failed I/O. How fast can we
> generate 1M failed I/Os? :)

So there's a one-in-a-million chance of missing a failed I/O ... if
we're generating lots of errors, the next time the app calls fsync(),
it'll notice the other million times we've hit the problem :-)

> Actually...we could put this field in the inode instead of the mapping.
> I know we've traditionally tracked this in the mapping, but is that
> required here?
> 
> If we put this field in the inode then perhaps we can union it with
> something and mitigate the cost of a larger counter...maybe in the
> i_pipe union? I don't think S_ISREG inodes use anything in there, do
> they?

But writeback isn't just done on ISREG inodes, but also on S_ISBLK inodes,
which use i_bdev (right?)

Another possibility is to move this out of the address_space and into
either the super_block or the backing_device_info.  Errors don't tend
to be constrained to a single file but affect the entire filesystem,
or even multiple filesystems if you have a partitioned block device ...

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Jeff Layton

On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
> 
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
>   unsigned int wb_err;
> 
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0 || err < -MAX_ERRNO);
> 
>   spin_lock(>i_lock);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>   WRITE_ONCE(mapping->wb_err, wb_err);
>   spin_unlock(>i_lock);
> }
> 

I like this idea of being able to store arbitrary error codes there.
That should be used judiciously of course, but we already allow
returning arbitrary errors via the ->fsync op anyway.

I'll plan to incorporate something like that into the next set (with
judicious comments and constants).

One question...is the i_lock the right way to protect this? I think we
could do this locklessly too (cmpxchg in a loop, for instance). I'm not
worried about performance here -- it's just nice to be able to call
simple stuff like this without worrying about locking.

> int filemap_report_wb_error(struct file *file)
> {
>   struct inode *inode = file_inode(file);
>   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> 
>   if (file->f_wb_err == wb_err)
>   return 0;
>   return -(wb_err & 4095);
> }
> 
> That only gives us 20 bits of counter, but I think that's enough.

2^20 is 1048576, which seems a little small to me.

We may end up bumping the counter on every failed I/O. How fast can we
generate 1M failed I/Os? :)

2^52 however is 4503599627370496 (4Tios or so) ... that might take a
little longer to overflow. Is it worth the cost here to ensure that
this won't occur?

Actually...we could put this field in the inode instead of the mapping.
I know we've traditionally tracked this in the mapping, but is that
required here?

If we put this field in the inode then perhaps we can union it with
something and mitigate the cost of a larger counter...maybe in the
i_pipe union? I don't think S_ISREG inodes use anything in there, do
they?
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Jeff Layton

On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
> 
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
>   unsigned int wb_err;
> 
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0 || err < -MAX_ERRNO);
> 
>   spin_lock(>i_lock);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>   WRITE_ONCE(mapping->wb_err, wb_err);
>   spin_unlock(>i_lock);
> }
> 

I like this idea of being able to store arbitrary error codes there.
That should be used judiciously of course, but we already allow
returning arbitrary errors via the ->fsync op anyway.

I'll plan to incorporate something like that into the next set (with
judicious comments and constants).

One question...is the i_lock the right way to protect this? I think we
could do this locklessly too (cmpxchg in a loop, for instance). I'm not
worried about performance here -- it's just nice to be able to call
simple stuff like this without worrying about locking.

> int filemap_report_wb_error(struct file *file)
> {
>   struct inode *inode = file_inode(file);
>   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> 
>   if (file->f_wb_err == wb_err)
>   return 0;
>   return -(wb_err & 4095);
> }
> 
> That only gives us 20 bits of counter, but I think that's enough.

2^20 is 1048576, which seems a little small to me.

We may end up bumping the counter on every failed I/O. How fast can we
generate 1M failed I/Os? :)

2^52 however is 4503599627370496 (4Tios or so) ... that might take a
little longer to overflow. Is it worth the cost here to ensure that
this won't occur?

Actually...we could put this field in the inode instead of the mapping.
I know we've traditionally tracked this in the mapping, but is that
required here?

If we put this field in the inode then perhaps we can union it with
something and mitigate the cost of a larger counter...maybe in the
i_pipe union? I don't think S_ISREG inodes use anything in there, do
they?
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Jeff Layton

On Wed, 2017-04-05 at 09:13 +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> > > > Agreed that we should focus on POSIX compliance. I'll also note that
> > > > POSIX states:
> > > > 
> > > > "If more than one error occurs in processing a function call, any one
> > > > of the possible errors may be returned, as the order of
> > > > detection is undefined."
> > > > 
> > > > 
> > > > http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> > > > 
> > > > So, I'd like to push back on this idea that we need to prefer reporting
> > > > -EIO over other errors. POSIX certainly doesn't mandate that. 
> > > 
> > > I honestly wonder if we need to support ENOSPC from writeback at all.
> > > Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
> > > in 2003:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
> > > 
> > > That seems to come from here:
> > > http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
> > > which is marked as a resend, but I can't find the original.
> > > 
> > > It's a little misleading because the immediately preceding patch
> > > introduced mapping->error, so there's no precedent here to speak of.
> > > It looks like we used to just silently lose writeback errors (*cough*).
> > > 
> > > I'd like to suggest that maybe we don't need to support multiple errors
> > > at all.  That all errors, including ENOSPC, get collapsed into EIO.
> > > POSIX already tells us to do that for close() and permits us to do that
> > > for fsync().
> > > 
> > 
> > That is certainly allowed under POSIX as I interpret the spec. At a
> > minimum we just need a single flag and can collapse all errors under
> > that.
> > 
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> What if you don't have an admin?  What if it was an over-quota error?
> I think precise error messages are valuable.
> I am leaning towards "last error wins" though.  The complexity of any
> scheme that reports "worst recent error" seems to out weigh the value.
> 
> I think we should present this as a service to filesystems. e.g. create
> a "recent_wb_error" structure which the filesystem can record errors in
> when they occur, and syscalls can read errors from.
> One of these would be provided in 'struct address_space', but
> filesystems can easily embed one in their own data structure
> (e.g. nfs_open_context) if they want to.
> 
> I don't think we should return a recent_wb_error on close by default,
> but individual filesystems can ("man 2 close" implies NFS does this for
> EDQUOT at it should continue to do so).
> 
> fsync() (and file_sync_range()) should return a recent_wb_error, but
> what about write()?  It would be a suitable way to stop an application
> early, but it isn't exactly the requested write that failed...
> Posix says of EIO from write:
> 
> A physical I/O error has occurred.
> 
> which is rather vague.  Where and when did this error in physics (:-)
> occur?
> 
> O_DIRECT write() can get an EIO from a previous write-back write to the
> same file.  Maybe non-O_DIRECT writes should too?
> 

Some already do this for buffered writes.

This is really a philosophical question, IMO...is it correct to return
an error on a write call, due to writeback failing previously or during
the write call, quite possibly to a range that the write call does not
touch? I can see an argument either way for this.

Also, if we do think that returning an error on the write is the right
thing to do, should that error advance the sequence counter in the
struct file, such that an fsync afterward gets back 0? My feeling here
is that fsync should still report an error after a failed write, but
maybe that's wrong?

This is certainly one area where switching to synchronous writes on
error would make things a little simpler.
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-05 Thread Jeff Layton

On Wed, 2017-04-05 at 09:13 +1000, NeilBrown wrote:
> On Tue, Apr 04 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> > > > Agreed that we should focus on POSIX compliance. I'll also note that
> > > > POSIX states:
> > > > 
> > > > "If more than one error occurs in processing a function call, any one
> > > > of the possible errors may be returned, as the order of
> > > > detection is undefined."
> > > > 
> > > > 
> > > > http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> > > > 
> > > > So, I'd like to push back on this idea that we need to prefer reporting
> > > > -EIO over other errors. POSIX certainly doesn't mandate that. 
> > > 
> > > I honestly wonder if we need to support ENOSPC from writeback at all.
> > > Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
> > > in 2003:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
> > > 
> > > That seems to come from here:
> > > http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
> > > which is marked as a resend, but I can't find the original.
> > > 
> > > It's a little misleading because the immediately preceding patch
> > > introduced mapping->error, so there's no precedent here to speak of.
> > > It looks like we used to just silently lose writeback errors (*cough*).
> > > 
> > > I'd like to suggest that maybe we don't need to support multiple errors
> > > at all.  That all errors, including ENOSPC, get collapsed into EIO.
> > > POSIX already tells us to do that for close() and permits us to do that
> > > for fsync().
> > > 
> > 
> > That is certainly allowed under POSIX as I interpret the spec. At a
> > minimum we just need a single flag and can collapse all errors under
> > that.
> > 
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> What if you don't have an admin?  What if it was an over-quota error?
> I think precise error messages are valuable.
> I am leaning towards "last error wins" though.  The complexity of any
> scheme that reports "worst recent error" seems to out weigh the value.
> 
> I think we should present this as a service to filesystems. e.g. create
> a "recent_wb_error" structure which the filesystem can record errors in
> when they occur, and syscalls can read errors from.
> One of these would be provided in 'struct address_space', but
> filesystems can easily embed one in their own data structure
> (e.g. nfs_open_context) if they want to.
> 
> I don't think we should return a recent_wb_error on close by default,
> but individual filesystems can ("man 2 close" implies NFS does this for
> EDQUOT at it should continue to do so).
> 
> fsync() (and file_sync_range()) should return a recent_wb_error, but
> what about write()?  It would be a suitable way to stop an application
> early, but it isn't exactly the requested write that failed...
> Posix says of EIO from write:
> 
> A physical I/O error has occurred.
> 
> which is rather vague.  Where and when did this error in physics (:-)
> occur?
> 
> O_DIRECT write() can get an EIO from a previous write-back write to the
> same file.  Maybe non-O_DIRECT writes should too?
> 

Some already do this for buffered writes.

This is really a philosophical question, IMO...is it correct to return
an error on a write call, due to writeback failing previously or during
the write call, quite possibly to a range that the write call does not
touch? I can see an argument either way for this.

Also, if we do think that returning an error on the write is the right
thing to do, should that error advance the sequence counter in the
struct file, such that an fsync afterward gets back 0? My feeling here
is that fsync should still report an error after a failed write, but
maybe that's wrong?

This is certainly one area where switching to synchronous writes on
error would make things a little simpler.
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
>> On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
>> > Agreed that we should focus on POSIX compliance. I'll also note that
>> > POSIX states:
>> > 
>> > "If more than one error occurs in processing a function call, any one
>> > of the possible errors may be returned, as the order of
>> > detection is undefined."
>> > 
>> > 
>> > http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
>> > 
>> > So, I'd like to push back on this idea that we need to prefer reporting
>> > -EIO over other errors. POSIX certainly doesn't mandate that. 
>> 
>> I honestly wonder if we need to support ENOSPC from writeback at all.
>> Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
>> in 2003:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
>> 
>> That seems to come from here:
>> http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
>> which is marked as a resend, but I can't find the original.
>> 
>> It's a little misleading because the immediately preceding patch
>> introduced mapping->error, so there's no precedent here to speak of.
>> It looks like we used to just silently lose writeback errors (*cough*).
>> 
>> I'd like to suggest that maybe we don't need to support multiple errors
>> at all.  That all errors, including ENOSPC, get collapsed into EIO.
>> POSIX already tells us to do that for close() and permits us to do that
>> for fsync().
>> 
>
> That is certainly allowed under POSIX as I interpret the spec. At a
> minimum we just need a single flag and can collapse all errors under
> that.
>
> That said, I think giving more specific errors where we can is useful.
> When your program is erroring out and writing 'I/O error' to the logs,
> then how much time will your admins burn before they figure out that it
> really failed because the filesystem was full?

What if you don't have an admin?  What if it was an over-quota error?
I think precise error messages are valuable.
I am leaning towards "last error wins" though.  The complexity of any
scheme that reports "worst recent error" seems to out weigh the value.

I think we should present this as a service to filesystems. e.g. create
a "recent_wb_error" structure which the filesystem can record errors in
when they occur, and syscalls can read errors from.
One of these would be provided in 'struct address_space', but
filesystems can easily embed one in their own data structure
(e.g. nfs_open_context) if they want to.

I don't think we should return a recent_wb_error on close by default,
but individual filesystems can ("man 2 close" implies NFS does this for
EDQUOT at it should continue to do so).

fsync() (and file_sync_range()) should return a recent_wb_error, but
what about write()?  It would be a suitable way to stop an application
early, but it isn't exactly the requested write that failed...
Posix says of EIO from write:

A physical I/O error has occurred.

which is rather vague.  Where and when did this error in physics (:-)
occur?

O_DIRECT write() can get an EIO from a previous write-back write to the
same file.  Maybe non-O_DIRECT writes should too?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
>> On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
>> > Agreed that we should focus on POSIX compliance. I'll also note that
>> > POSIX states:
>> > 
>> > "If more than one error occurs in processing a function call, any one
>> > of the possible errors may be returned, as the order of
>> > detection is undefined."
>> > 
>> > 
>> > http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
>> > 
>> > So, I'd like to push back on this idea that we need to prefer reporting
>> > -EIO over other errors. POSIX certainly doesn't mandate that. 
>> 
>> I honestly wonder if we need to support ENOSPC from writeback at all.
>> Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
>> in 2003:
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
>> 
>> That seems to come from here:
>> http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
>> which is marked as a resend, but I can't find the original.
>> 
>> It's a little misleading because the immediately preceding patch
>> introduced mapping->error, so there's no precedent here to speak of.
>> It looks like we used to just silently lose writeback errors (*cough*).
>> 
>> I'd like to suggest that maybe we don't need to support multiple errors
>> at all.  That all errors, including ENOSPC, get collapsed into EIO.
>> POSIX already tells us to do that for close() and permits us to do that
>> for fsync().
>> 
>
> That is certainly allowed under POSIX as I interpret the spec. At a
> minimum we just need a single flag and can collapse all errors under
> that.
>
> That said, I think giving more specific errors where we can is useful.
> When your program is erroring out and writing 'I/O error' to the logs,
> then how much time will your admins burn before they figure out that it
> really failed because the filesystem was full?

What if you don't have an admin?  What if it was an over-quota error?
I think precise error messages are valuable.
I am leaning towards "last error wins" though.  The complexity of any
scheme that reports "worst recent error" seems to out weigh the value.

I think we should present this as a service to filesystems. e.g. create
a "recent_wb_error" structure which the filesystem can record errors in
when they occur, and syscalls can read errors from.
One of these would be provided in 'struct address_space', but
filesystems can easily embed one in their own data structure
(e.g. nfs_open_context) if they want to.

I don't think we should return a recent_wb_error on close by default,
but individual filesystems can ("man 2 close" implies NFS does this for
EDQUOT at it should continue to do so).

fsync() (and file_sync_range()) should return a recent_wb_error, but
what about write()?  It would be a suitable way to stop an application
early, but it isn't exactly the requested write that failed...
Posix says of EIO from write:

A physical I/O error has occurred.

which is rather vague.  Where and when did this error in physics (:-)
occur?

O_DIRECT write() can get an EIO from a previous write-back write to the
same file.  Maybe non-O_DIRECT writes should too?

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Matthew Wilcox wrote:

> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
>> That said, I think giving more specific errors where we can is useful.
>> When your program is erroring out and writing 'I/O error' to the logs,
>> then how much time will your admins burn before they figure out that it
>> really failed because the filesystem was full?
>
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
>
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
>
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
>   unsigned int wb_err;
>
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0 || err < -MAX_ERRNO);
>
>   spin_lock(>i_lock);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;

Seriously?  You are missing "MAX_ERRNO" (4095) together with "1 << 12"
(4096) in the one expression without a big comment saying why?

Surely:
>   BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO+1);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (MAX_ERRNO+1)) | -err;

NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Matthew Wilcox wrote:

> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
>> That said, I think giving more specific errors where we can is useful.
>> When your program is erroring out and writing 'I/O error' to the logs,
>> then how much time will your admins burn before they figure out that it
>> really failed because the filesystem was full?
>
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
>
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
>
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
>   unsigned int wb_err;
>
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0 || err < -MAX_ERRNO);
>
>   spin_lock(>i_lock);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;

Seriously?  You are missing "MAX_ERRNO" (4095) together with "1 << 12"
(4096) in the one expression without a big comment saying why?

Surely:
>   BUILD_BUG_ON_NOT_POWER_OF_2(MAX_ERRNO+1);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (MAX_ERRNO+1)) | -err;

NeilBrown


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 13:03 +1000, NeilBrown wrote:
>> On Mon, Apr 03 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> > > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
>> > > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
>> > > > > writeback problem.  If I understand correctly, at the time of 
>> > > > > write(),
>> > > > > filesystems check to see if they have enough blocks to satisfy the
>> > > > > request, so ENOSPC only comes up in the writeback context for thinly
>> > > > > provisioned devices.
>> > > > 
>> > > > No, ENOSPC on writeback can certainly happen with network filesystems.
>> > > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
>> > > > do an extra RPC on every buffered write. :)
>> > > 
>> > > Aaah, yes, network filesystems.  I would indeed not want to do an extra
>> > > RPC on every write to a hole (it's a hole vs non-hole question, rather
>> > > than a buffered/unbuffered question ... unless you're WAFLing and not
>> > > reclaiming quickly enough, I suppose).
>> > > 
>> > > So, OK, that makes sense, we should keep allowing filesystems to report
>> > > ENOSPC as a writeback error.  But I think much of the argument below
>> > > still holds, and we should continue to have a prior EIO to be reported
>> > > over a new ENOSPC (even if the program has already consumed the EIO).
>> > > 
>> > 
>> > I'm fine with that (though I'd like Neil's thoughts before we decide
>> > anything) there.
>> 
>> I'd like there be a well defined time when old errors were forgotten.
>> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
>> received, but not forever.
>> Clearing the remembered errors when put_write_access() causes
>> i_writecount to reach zero is one option (as suggested), but I'm not
>> sure I'm happy with it.
>> 
>> Local filesystems, or network filesystems which receive strong write
>> delegations, should only ever return EIO to fsync.  We should
>> concentrate on them first, I think.  As there is only one possible
>> error, the seq counter is sufficient to "clear" it once it has been
>> reported to fsync() (or write()?).
>> 
>> Other network filesystems could return a whole host of errors: ENOSPC
>> EDQUOT ESTALE EPERM EFBIG ...
>> Do we want to limit exactly which errors are allowed in generic code, or
>> do we just support EIO generically and expect the filesystem to sort out
>> the details for anything else?
>> 
>> One possible approach a filesystem could take is just to allow a single
>> async writeback error.  After that error, all subsequent write()
>> system calls become synchronous. As write() or fsync() is called on each
>> file descriptor (which could possibly have sent the write which caused
>> the error), an error is returned and that fact is counted.  Once we have
>> returned as many errors as there are open file descriptors
>> (i_writecount?), and have seen a successful write, the filesystem
>> forgets all recorded errors and switches back to async writes (for that
>> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
>> 
>> The "which could possibly have sent the write which caused the error" is
>> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
>> flags to return async errors.  It allocates an nfs_open_context for each
>> user who opens a given inode, and stores an error in there.  Each dirty
>> pages is associated with one of these, so errors a sure to go to the
>> correct user, though not necessarily the correct fd at present.
>> 
>> When we specify the new behaviour we should be careful to be as vague as
>> possible while still saying what we need.  This allows filesystems some
>> flexibility.
>> 
>>   If an error happens during writeback, the next write() or fsync() (or
>>   ) on the file descriptor to which data was written will return -1
>>   with errno set to EIO or some other relevant error.  Other file
>>   descriptors open on the same file may receive EIO or some other error
>>   on a subsequent appropriate system call.
>>   It should not be assumed that close() will return an error.  fsync()
>>   must be called before close() if writeback errors are important to the
>>   application.
>> 
>> 
>
> A lot in here... :)
>
> While I like the NFS method of switching to sync I/O on error (and
> indeed, I'm copying that in the Ceph ENOSPC patches I have), I'm not
> sure it would really help anything here. The main reason NFS does that
> is to prevent you from dirtying tons of pages that can't be cleaned. 

It would help because it means there are no longer any async errors, so
there is no need to try to keep track of them.

If we decided that "last error wins", then that becomes irrelevant.  But
if we do care about any precedence of errors, then going sync is an easy
way to make sure the right error gets to the right place.

>
> While that is a laudable goal,

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Jeff Layton wrote:

> On Tue, 2017-04-04 at 13:03 +1000, NeilBrown wrote:
>> On Mon, Apr 03 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> > > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
>> > > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
>> > > > > writeback problem.  If I understand correctly, at the time of 
>> > > > > write(),
>> > > > > filesystems check to see if they have enough blocks to satisfy the
>> > > > > request, so ENOSPC only comes up in the writeback context for thinly
>> > > > > provisioned devices.
>> > > > 
>> > > > No, ENOSPC on writeback can certainly happen with network filesystems.
>> > > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
>> > > > do an extra RPC on every buffered write. :)
>> > > 
>> > > Aaah, yes, network filesystems.  I would indeed not want to do an extra
>> > > RPC on every write to a hole (it's a hole vs non-hole question, rather
>> > > than a buffered/unbuffered question ... unless you're WAFLing and not
>> > > reclaiming quickly enough, I suppose).
>> > > 
>> > > So, OK, that makes sense, we should keep allowing filesystems to report
>> > > ENOSPC as a writeback error.  But I think much of the argument below
>> > > still holds, and we should continue to have a prior EIO to be reported
>> > > over a new ENOSPC (even if the program has already consumed the EIO).
>> > > 
>> > 
>> > I'm fine with that (though I'd like Neil's thoughts before we decide
>> > anything) there.
>> 
>> I'd like there be a well defined time when old errors were forgotten.
>> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
>> received, but not forever.
>> Clearing the remembered errors when put_write_access() causes
>> i_writecount to reach zero is one option (as suggested), but I'm not
>> sure I'm happy with it.
>> 
>> Local filesystems, or network filesystems which receive strong write
>> delegations, should only ever return EIO to fsync.  We should
>> concentrate on them first, I think.  As there is only one possible
>> error, the seq counter is sufficient to "clear" it once it has been
>> reported to fsync() (or write()?).
>> 
>> Other network filesystems could return a whole host of errors: ENOSPC
>> EDQUOT ESTALE EPERM EFBIG ...
>> Do we want to limit exactly which errors are allowed in generic code, or
>> do we just support EIO generically and expect the filesystem to sort out
>> the details for anything else?
>> 
>> One possible approach a filesystem could take is just to allow a single
>> async writeback error.  After that error, all subsequent write()
>> system calls become synchronous. As write() or fsync() is called on each
>> file descriptor (which could possibly have sent the write which caused
>> the error), an error is returned and that fact is counted.  Once we have
>> returned as many errors as there are open file descriptors
>> (i_writecount?), and have seen a successful write, the filesystem
>> forgets all recorded errors and switches back to async writes (for that
>> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
>> 
>> The "which could possibly have sent the write which caused the error" is
>> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
>> flags to return async errors.  It allocates an nfs_open_context for each
>> user who opens a given inode, and stores an error in there.  Each dirty
>> pages is associated with one of these, so errors a sure to go to the
>> correct user, though not necessarily the correct fd at present.
>> 
>> When we specify the new behaviour we should be careful to be as vague as
>> possible while still saying what we need.  This allows filesystems some
>> flexibility.
>> 
>>   If an error happens during writeback, the next write() or fsync() (or
>>   ) on the file descriptor to which data was written will return -1
>>   with errno set to EIO or some other relevant error.  Other file
>>   descriptors open on the same file may receive EIO or some other error
>>   on a subsequent appropriate system call.
>>   It should not be assumed that close() will return an error.  fsync()
>>   must be called before close() if writeback errors are important to the
>>   application.
>> 
>> 
>
> A lot in here... :)
>
> While I like the NFS method of switching to sync I/O on error (and
> indeed, I'm copying that in the Ceph ENOSPC patches I have), I'm not
> sure it would really help anything here. The main reason NFS does that
> is to prevent you from dirtying tons of pages that can't be cleaned. 

It would help because it means there are no longer any async errors, so
there is no need to try to keep track of them.

If we decided that "last error wins", then that becomes irrelevant.  But
if we do care about any precedence of errors, then going sync is an easy
way to make sure the right error gets to the right place.

>
> While that is a laudable goal,

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Matthew Wilcox wrote:

> On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
>> On Mon, Apr 03 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> >> So, OK, that makes sense, we should keep allowing filesystems to report
>> >> ENOSPC as a writeback error.  But I think much of the argument below
>> >> still holds, and we should continue to have a prior EIO to be reported
>> >> over a new ENOSPC (even if the program has already consumed the EIO).
>> >
>> > I'm fine with that (though I'd like Neil's thoughts before we decide
>> > anything) there.
>> 
>> I'd like there be a well defined time when old errors were forgotten.
>> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
>> received, but not forever.
>> Clearing the remembered errors when put_write_access() causes
>> i_writecount to reach zero is one option (as suggested), but I'm not
>> sure I'm happy with it.
>> 
>> Local filesystems, or network filesystems which receive strong write
>> delegations, should only ever return EIO to fsync.  We should
>> concentrate on them first, I think.  As there is only one possible
>> error, the seq counter is sufficient to "clear" it once it has been
>> reported to fsync() (or write()?).
>> 
>> Other network filesystems could return a whole host of errors: ENOSPC
>> EDQUOT ESTALE EPERM EFBIG ...
>> Do we want to limit exactly which errors are allowed in generic code, or
>> do we just support EIO generically and expect the filesystem to sort out
>> the details for anything else?
>
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
>
> For close(), we have to map every error to EIO.
> For fsync(), we can return any error that write() could have.  That limits
> us to:
>
> EFBIG ENOSPC EIO ENOBUFS ENXIO
>
> I think EFBIG really isn't a writeback error; are there any network
> filesystems that don't know the file size limit at the time they accept
> the original write?  ENOBUFS seems like a transient error (*this* call to
> fsync() failed, but the next one may succeed ... it's the equivalent of
> ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
> error.  So that leaves us with ENOSPC and EIO, as we have support today.

I guess Posix doesn't acknowledge the existence of disk quotas?
I think we need to add EDQUOT to your list.
Other hypothetical errors errors from the server such as EPERM or ESTALE
can reasonably be mapped to EIO.

>
>> One possible approach a filesystem could take is just to allow a single
>> async writeback error.  After that error, all subsequent write()
>> system calls become synchronous. As write() or fsync() is called on each
>> file descriptor (which could possibly have sent the write which caused
>> the error), an error is returned and that fact is counted.  Once we have
>> returned as many errors as there are open file descriptors
>> (i_writecount?), and have seen a successful write, the filesystem
>> forgets all recorded errors and switches back to async writes (for that
>> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
>> 
>> The "which could possibly have sent the write which caused the error" is
>> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
>> flags to return async errors.  It allocates an nfs_open_context for each
>> user who opens a given inode, and stores an error in there.  Each dirty
>> pages is associated with one of these, so errors a sure to go to the
>> correct user, though not necessarily the correct fd at present.
>
> ... and you need the nfs_open_context in order to use the correct
> credentials when writing a page to the server, correct?

Correct.

Thanks,
NeilBrown


>
>> When we specify the new behaviour we should be careful to be as vague as
>> possible while still saying what we need.  This allows filesystems some
>> flexibility.
>> 
>>   If an error happens during writeback, the next write() or fsync() (or
>>   ) on the file descriptor to which data was written will return -1
>>   with errno set to EIO or some other relevant error.  Other file
>>   descriptors open on the same file may receive EIO or some other error
>>   on a subsequent appropriate system call.
>>   It should not be assumed that close() will return an error.  fsync()
>>   must be called before close() if writeback errors are important to the
>>   application.
>
> Thanks for explaining what NFS does today.


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread NeilBrown

On Tue, Apr 04 2017, Matthew Wilcox wrote:

> On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
>> On Mon, Apr 03 2017, Jeff Layton wrote:
>> 
>> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> >> So, OK, that makes sense, we should keep allowing filesystems to report
>> >> ENOSPC as a writeback error.  But I think much of the argument below
>> >> still holds, and we should continue to have a prior EIO to be reported
>> >> over a new ENOSPC (even if the program has already consumed the EIO).
>> >
>> > I'm fine with that (though I'd like Neil's thoughts before we decide
>> > anything) there.
>> 
>> I'd like there be a well defined time when old errors were forgotten.
>> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
>> received, but not forever.
>> Clearing the remembered errors when put_write_access() causes
>> i_writecount to reach zero is one option (as suggested), but I'm not
>> sure I'm happy with it.
>> 
>> Local filesystems, or network filesystems which receive strong write
>> delegations, should only ever return EIO to fsync.  We should
>> concentrate on them first, I think.  As there is only one possible
>> error, the seq counter is sufficient to "clear" it once it has been
>> reported to fsync() (or write()?).
>> 
>> Other network filesystems could return a whole host of errors: ENOSPC
>> EDQUOT ESTALE EPERM EFBIG ...
>> Do we want to limit exactly which errors are allowed in generic code, or
>> do we just support EIO generically and expect the filesystem to sort out
>> the details for anything else?
>
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
>
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
>
> For close(), we have to map every error to EIO.
> For fsync(), we can return any error that write() could have.  That limits
> us to:
>
> EFBIG ENOSPC EIO ENOBUFS ENXIO
>
> I think EFBIG really isn't a writeback error; are there any network
> filesystems that don't know the file size limit at the time they accept
> the original write?  ENOBUFS seems like a transient error (*this* call to
> fsync() failed, but the next one may succeed ... it's the equivalent of
> ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
> error.  So that leaves us with ENOSPC and EIO, as we have support today.

I guess Posix doesn't acknowledge the existence of disk quotas?
I think we need to add EDQUOT to your list.
Other hypothetical errors errors from the server such as EPERM or ESTALE
can reasonably be mapped to EIO.

>
>> One possible approach a filesystem could take is just to allow a single
>> async writeback error.  After that error, all subsequent write()
>> system calls become synchronous. As write() or fsync() is called on each
>> file descriptor (which could possibly have sent the write which caused
>> the error), an error is returned and that fact is counted.  Once we have
>> returned as many errors as there are open file descriptors
>> (i_writecount?), and have seen a successful write, the filesystem
>> forgets all recorded errors and switches back to async writes (for that
>> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
>> 
>> The "which could possibly have sent the write which caused the error" is
>> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
>> flags to return async errors.  It allocates an nfs_open_context for each
>> user who opens a given inode, and stores an error in there.  Each dirty
>> pages is associated with one of these, so errors a sure to go to the
>> correct user, though not necessarily the correct fd at present.
>
> ... and you need the nfs_open_context in order to use the correct
> credentials when writing a page to the server, correct?

Correct.

Thanks,
NeilBrown


>
>> When we specify the new behaviour we should be careful to be as vague as
>> possible while still saying what we need.  This allows filesystems some
>> flexibility.
>> 
>>   If an error happens during writeback, the next write() or fsync() (or
>>   ) on the file descriptor to which data was written will return -1
>>   with errno set to EIO or some other relevant error.  Other file
>>   descriptors open on the same file may receive EIO or some other error
>>   on a subsequent appropriate system call.
>>   It should not be assumed that close() will return an error.  fsync()
>>   must be called before close() if writeback errors are important to the
>>   application.
>
> Thanks for explaining what NFS does today.


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
> 
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
>   unsigned int wb_err;
> 
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0 || err < -MAX_ERRNO);
> 
>   spin_lock(>i_lock);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>   WRITE_ONCE(mapping->wb_err, wb_err);

Do we need the WRITE_ONCE, given that you're under a spinlock there?

>   spin_unlock(>i_lock);
> }
> 
> int filemap_report_wb_error(struct file *file)
> {
>   struct inode *inode = file_inode(file);
>   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> 
>   if (file->f_wb_err == wb_err)
>   return 0;
>   return -(wb_err & 4095);
> }
> 
> That only gives us 20 bits of counter, but I think that's enough.

That'd be fine with me, but I'm all for allowing filesystems to return
arbitrary writeback errors on fsync.

Others may have different opinions there. We could add a wrapper
function that sanitizes the error codes if some filesystems wanted that
though.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 10:09 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> > That said, I think giving more specific errors where we can is useful.
> > When your program is erroring out and writing 'I/O error' to the logs,
> > then how much time will your admins burn before they figure out that it
> > really failed because the filesystem was full?
> 
> df is one of the first things I check ... a few years ago, I also learned
> to check df -i ... ;-)
> 
> Anyway, given the decision to simply report the last error lets us do this
> implementation:
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
>   unsigned int wb_err;
> 
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0 || err < -MAX_ERRNO);
> 
>   spin_lock(>i_lock);
>   wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
>   WRITE_ONCE(mapping->wb_err, wb_err);

Do we need the WRITE_ONCE, given that you're under a spinlock there?

>   spin_unlock(>i_lock);
> }
> 
> int filemap_report_wb_error(struct file *file)
> {
>   struct inode *inode = file_inode(file);
>   unsigned int wb_err = READ_ONCE(mapping->wb_err);
> 
>   if (file->f_wb_err == wb_err)
>   return 0;
>   return -(wb_err & 4095);
> }
> 
> That only gives us 20 bits of counter, but I think that's enough.

That'd be fine with me, but I'm all for allowing filesystems to return
arbitrary writeback errors on fsync.

Others may have different opinions there. We could add a wrapper
function that sanitizes the error codes if some filesystems wanted that
though.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Matthew Wilcox

On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> That said, I think giving more specific errors where we can is useful.
> When your program is erroring out and writing 'I/O error' to the logs,
> then how much time will your admins burn before they figure out that it
> really failed because the filesystem was full?

df is one of the first things I check ... a few years ago, I also learned
to check df -i ... ;-)

Anyway, given the decision to simply report the last error lets us do this
implementation:

void filemap_set_wb_error(struct address_space *mapping, int err)
{
struct inode *inode = mapping->host;
unsigned int wb_err;

if (!err)
return;
/*
 * This should be called with the error code that we want to return
 * on fsync. Thus, it should always be <= 0.
 */
WARN_ON(err > 0 || err < -MAX_ERRNO);

spin_lock(>i_lock);
wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
WRITE_ONCE(mapping->wb_err, wb_err);
spin_unlock(>i_lock);
}

int filemap_report_wb_error(struct file *file)
{
struct inode *inode = file_inode(file);
unsigned int wb_err = READ_ONCE(mapping->wb_err);

if (file->f_wb_err == wb_err)
return 0;
return -(wb_err & 4095);
}

That only gives us 20 bits of counter, but I think that's enough.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Matthew Wilcox

On Tue, Apr 04, 2017 at 12:25:46PM -0400, Jeff Layton wrote:
> That said, I think giving more specific errors where we can is useful.
> When your program is erroring out and writing 'I/O error' to the logs,
> then how much time will your admins burn before they figure out that it
> really failed because the filesystem was full?

df is one of the first things I check ... a few years ago, I also learned
to check df -i ... ;-)

Anyway, given the decision to simply report the last error lets us do this
implementation:

void filemap_set_wb_error(struct address_space *mapping, int err)
{
struct inode *inode = mapping->host;
unsigned int wb_err;

if (!err)
return;
/*
 * This should be called with the error code that we want to return
 * on fsync. Thus, it should always be <= 0.
 */
WARN_ON(err > 0 || err < -MAX_ERRNO);

spin_lock(>i_lock);
wb_err = ((mapping->wb_err & ~MAX_ERRNO) + (1 << 12)) | -err;
WRITE_ONCE(mapping->wb_err, wb_err);
spin_unlock(>i_lock);
}

int filemap_report_wb_error(struct file *file)
{
struct inode *inode = file_inode(file);
unsigned int wb_err = READ_ONCE(mapping->wb_err);

if (file->f_wb_err == wb_err)
return 0;
return -(wb_err & 4095);
}

That only gives us 20 bits of counter, but I think that's enough.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> > Agreed that we should focus on POSIX compliance. I'll also note that
> > POSIX states:
> > 
> > "If more than one error occurs in processing a function call, any one
> > of the possible errors may be returned, as the order of
> > detection is undefined."
> > 
> > 
> > http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> > 
> > So, I'd like to push back on this idea that we need to prefer reporting
> > -EIO over other errors. POSIX certainly doesn't mandate that. 
> 
> I honestly wonder if we need to support ENOSPC from writeback at all.
> Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
> in 2003:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
> 
> That seems to come from here:
> http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
> which is marked as a resend, but I can't find the original.
> 
> It's a little misleading because the immediately preceding patch
> introduced mapping->error, so there's no precedent here to speak of.
> It looks like we used to just silently lose writeback errors (*cough*).
> 
> I'd like to suggest that maybe we don't need to support multiple errors
> at all.  That all errors, including ENOSPC, get collapsed into EIO.
> POSIX already tells us to do that for close() and permits us to do that
> for fsync().
> 

That is certainly allowed under POSIX as I interpret the spec. At a
minimum we just need a single flag and can collapse all errors under
that.

That said, I think giving more specific errors where we can is useful.
When your program is erroring out and writing 'I/O error' to the logs,
then how much time will your admins burn before they figure out that it
really failed because the filesystem was full?
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 09:12 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> > Agreed that we should focus on POSIX compliance. I'll also note that
> > POSIX states:
> > 
> > "If more than one error occurs in processing a function call, any one
> > of the possible errors may be returned, as the order of
> > detection is undefined."
> > 
> > 
> > http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> > 
> > So, I'd like to push back on this idea that we need to prefer reporting
> > -EIO over other errors. POSIX certainly doesn't mandate that. 
> 
> I honestly wonder if we need to support ENOSPC from writeback at all.
> Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
> in 2003:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16
> 
> That seems to come from here:
> http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
> which is marked as a resend, but I can't find the original.
> 
> It's a little misleading because the immediately preceding patch
> introduced mapping->error, so there's no precedent here to speak of.
> It looks like we used to just silently lose writeback errors (*cough*).
> 
> I'd like to suggest that maybe we don't need to support multiple errors
> at all.  That all errors, including ENOSPC, get collapsed into EIO.
> POSIX already tells us to do that for close() and permits us to do that
> for fsync().
> 

That is certainly allowed under POSIX as I interpret the spec. At a
minimum we just need a single flag and can collapse all errors under
that.

That said, I think giving more specific errors where we can is useful.
When your program is erroring out and writing 'I/O error' to the logs,
then how much time will your admins burn before they figure out that it
really failed because the filesystem was full?
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Matthew Wilcox

On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> Agreed that we should focus on POSIX compliance. I'll also note that
> POSIX states:
> 
> "If more than one error occurs in processing a function call, any one
> of the possible errors may be returned, as the order of
> detection is undefined."
> 
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> 
> So, I'd like to push back on this idea that we need to prefer reporting
> -EIO over other errors. POSIX certainly doesn't mandate that. 

I honestly wonder if we need to support ENOSPC from writeback at all.
Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
in 2003:

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16

That seems to come from here:
http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
which is marked as a resend, but I can't find the original.

It's a little misleading because the immediately preceding patch
introduced mapping->error, so there's no precedent here to speak of.
It looks like we used to just silently lose writeback errors (*cough*).

I'd like to suggest that maybe we don't need to support multiple errors
at all.  That all errors, including ENOSPC, get collapsed into EIO.
POSIX already tells us to do that for close() and permits us to do that
for fsync().

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Matthew Wilcox

On Tue, Apr 04, 2017 at 08:17:48AM -0400, Jeff Layton wrote:
> Agreed that we should focus on POSIX compliance. I'll also note that
> POSIX states:
> 
> "If more than one error occurs in processing a function call, any one
> of the possible errors may be returned, as the order of
> detection is undefined."
> 
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03
> 
> So, I'd like to push back on this idea that we need to prefer reporting
> -EIO over other errors. POSIX certainly doesn't mandate that. 

I honestly wonder if we need to support ENOSPC from writeback at all.
Looking at our history, the AS_EIO / AS_ENOSPC came from this patch
in 2003:

https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=fcad2b42fc2e15a94ba1a1ba8535681a735bfd16

That seems to come from here:
http://lkml.iu.edu/hypermail/linux/kernel/0308.0/0205.html
which is marked as a resend, but I can't find the original.

It's a little misleading because the immediately preceding patch
introduced mapping->error, so there's no precedent here to speak of.
It looks like we used to just silently lose writeback errors (*cough*).

I'd like to suggest that maybe we don't need to support multiple errors
at all.  That all errors, including ENOSPC, get collapsed into EIO.
POSIX already tells us to do that for close() and permits us to do that
for fsync().

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Theodore Ts'o

On Tue, Apr 04, 2017 at 04:53:58AM -0700, Matthew Wilcox wrote:
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
> 
> For close(), we have to map every error to EIO.

I don't believe that's true.  POSIX explicitly states that
implementations may return additional errors, and define additional
errors, in addition to the ones that are listeds in the specification.

The specification is pretty clear about saying when a particular
system call "shall" fail (meaning it must fail if it was so listed),
and when it "may" fail.  But no where does it say that these are the
only situations when a system call is allowed to fail.

Which is good, because ext4 and xfs will both return EUCLEAN if the
file system is corrupted.  (Mainly because it's too painful to define
a new errno, EFSCORRUPTED --- not because of trying to get it into
Posix, but because it's painful to get new errno's defined in glibc.)
POSIX says *nothing* about file systems being corrupted, and if your
interpretation were correct, we're already in violation of POSIX

  - Ted

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Theodore Ts'o

On Tue, Apr 04, 2017 at 04:53:58AM -0700, Matthew Wilcox wrote:
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
> 
> For close(), we have to map every error to EIO.

I don't believe that's true.  POSIX explicitly states that
implementations may return additional errors, and define additional
errors, in addition to the ones that are listeds in the specification.

The specification is pretty clear about saying when a particular
system call "shall" fail (meaning it must fail if it was so listed),
and when it "may" fail.  But no where does it say that these are the
only situations when a system call is allowed to fail.

Which is good, because ext4 and xfs will both return EUCLEAN if the
file system is corrupted.  (Mainly because it's too painful to define
a new errno, EFSCORRUPTED --- not because of trying to get it into
Posix, but because it's painful to get new errno's defined in glibc.)
POSIX says *nothing* about file systems being corrupted, and if your
interpretation were correct, we're already in violation of POSIX

  - Ted

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 04:53 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
> > On Mon, Apr 03 2017, Jeff Layton wrote:
> > 
> > > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > > > So, OK, that makes sense, we should keep allowing filesystems to report
> > > > ENOSPC as a writeback error.  But I think much of the argument below
> > > > still holds, and we should continue to have a prior EIO to be reported
> > > > over a new ENOSPC (even if the program has already consumed the EIO).
> > > 
> > > I'm fine with that (though I'd like Neil's thoughts before we decide
> > > anything) there.
> > 
> > I'd like there be a well defined time when old errors were forgotten.
> > It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> > received, but not forever.
> > Clearing the remembered errors when put_write_access() causes
> > i_writecount to reach zero is one option (as suggested), but I'm not
> > sure I'm happy with it.
> > 
> > Local filesystems, or network filesystems which receive strong write
> > delegations, should only ever return EIO to fsync.  We should
> > concentrate on them first, I think.  As there is only one possible
> > error, the seq counter is sufficient to "clear" it once it has been
> > reported to fsync() (or write()?).
> > 
> > Other network filesystems could return a whole host of errors: ENOSPC
> > EDQUOT ESTALE EPERM EFBIG ...
> > Do we want to limit exactly which errors are allowed in generic code, or
> > do we just support EIO generically and expect the filesystem to sort out
> > the details for anything else?
> 
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
> 
> For close(), we have to map every error to EIO.
> For fsync(), we can return any error that write() could have.  That limits
> us to:
> 
> EFBIG ENOSPC EIO ENOBUFS ENXIO
> 
> I think EFBIG really isn't a writeback error; are there any network
> filesystems that don't know the file size limit at the time they accept
> the original write?  ENOBUFS seems like a transient error (*this* call to
> fsync() failed, but the next one may succeed ... it's the equivalent of
> ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
> error.  So that leaves us with ENOSPC and EIO, as we have support today.
> 

Agreed that we should focus on POSIX compliance. I'll also note that
POSIX states:

"If more than one error occurs in processing a function call, any one
of the possible errors may be returned, as the order of
detection is undefined."


http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03

So, I'd like to push back on this idea that we need to prefer reporting
-EIO over other errors. POSIX certainly doesn't mandate that. 

If we agree that that is the case, then I think the simplest thing to
do here would be to clear the other error flag(s) when we get a new
error, such that we only preserve the latest one. With that, we also
wouldn't need to clear anything out when i_writecount goes to zero
either. It would "just work" without that.

> > One possible approach a filesystem could take is just to allow a single
> > async writeback error.  After that error, all subsequent write()
> > system calls become synchronous. As write() or fsync() is called on each
> > file descriptor (which could possibly have sent the write which caused
> > the error), an error is returned and that fact is counted.  Once we have
> > returned as many errors as there are open file descriptors
> > (i_writecount?), and have seen a successful write, the filesystem
> > forgets all recorded errors and switches back to async writes (for that
> > inode).   NFS does this switch-to-sync-on-error.  See 
> > nfs_need_check_write().
> > 
> > The "which could possibly have sent the write which caused the error" is
> > an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> > flags to return async errors.  It allocates an nfs_open_context for each
> > user who opens a given inode, and stores an error in there.  Each dirty
> > pages is associated with one of these, so errors a sure to go to the
> > correct user, though not necessarily the correct fd at present.
> 
> ... and you need the nfs_open_context in order to use the correct
> credentials when writing a page to the server, correct?
> 

Yes, and it is expensive. I don't think we want to do that at the
generic VFS layer if we can at all help it.

> > When we specify the new behaviour we should be careful to be as vague as
> > possible while still saying what we need.  This allows filesystems some
> > flexibility.
> > 
> >   If an error happens during writeback, the next write() or fsync() (or
> >   ) on the

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 04:53 -0700, Matthew Wilcox wrote:
> On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
> > On Mon, Apr 03 2017, Jeff Layton wrote:
> > 
> > > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > > > So, OK, that makes sense, we should keep allowing filesystems to report
> > > > ENOSPC as a writeback error.  But I think much of the argument below
> > > > still holds, and we should continue to have a prior EIO to be reported
> > > > over a new ENOSPC (even if the program has already consumed the EIO).
> > > 
> > > I'm fine with that (though I'd like Neil's thoughts before we decide
> > > anything) there.
> > 
> > I'd like there be a well defined time when old errors were forgotten.
> > It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> > received, but not forever.
> > Clearing the remembered errors when put_write_access() causes
> > i_writecount to reach zero is one option (as suggested), but I'm not
> > sure I'm happy with it.
> > 
> > Local filesystems, or network filesystems which receive strong write
> > delegations, should only ever return EIO to fsync.  We should
> > concentrate on them first, I think.  As there is only one possible
> > error, the seq counter is sufficient to "clear" it once it has been
> > reported to fsync() (or write()?).
> > 
> > Other network filesystems could return a whole host of errors: ENOSPC
> > EDQUOT ESTALE EPERM EFBIG ...
> > Do we want to limit exactly which errors are allowed in generic code, or
> > do we just support EIO generically and expect the filesystem to sort out
> > the details for anything else?
> 
> I'd like us to focus on our POSIX compliance here and not return
> arbitrary errors.  The relevant pages are here:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
> http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html
> 
> For close(), we have to map every error to EIO.
> For fsync(), we can return any error that write() could have.  That limits
> us to:
> 
> EFBIG ENOSPC EIO ENOBUFS ENXIO
> 
> I think EFBIG really isn't a writeback error; are there any network
> filesystems that don't know the file size limit at the time they accept
> the original write?  ENOBUFS seems like a transient error (*this* call to
> fsync() failed, but the next one may succeed ... it's the equivalent of
> ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
> error.  So that leaves us with ENOSPC and EIO, as we have support today.
> 

Agreed that we should focus on POSIX compliance. I'll also note that
POSIX states:

"If more than one error occurs in processing a function call, any one
of the possible errors may be returned, as the order of
detection is undefined."


http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_03

So, I'd like to push back on this idea that we need to prefer reporting
-EIO over other errors. POSIX certainly doesn't mandate that. 

If we agree that that is the case, then I think the simplest thing to
do here would be to clear the other error flag(s) when we get a new
error, such that we only preserve the latest one. With that, we also
wouldn't need to clear anything out when i_writecount goes to zero
either. It would "just work" without that.

> > One possible approach a filesystem could take is just to allow a single
> > async writeback error.  After that error, all subsequent write()
> > system calls become synchronous. As write() or fsync() is called on each
> > file descriptor (which could possibly have sent the write which caused
> > the error), an error is returned and that fact is counted.  Once we have
> > returned as many errors as there are open file descriptors
> > (i_writecount?), and have seen a successful write, the filesystem
> > forgets all recorded errors and switches back to async writes (for that
> > inode).   NFS does this switch-to-sync-on-error.  See 
> > nfs_need_check_write().
> > 
> > The "which could possibly have sent the write which caused the error" is
> > an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> > flags to return async errors.  It allocates an nfs_open_context for each
> > user who opens a given inode, and stores an error in there.  Each dirty
> > pages is associated with one of these, so errors a sure to go to the
> > correct user, though not necessarily the correct fd at present.
> 
> ... and you need the nfs_open_context in order to use the correct
> credentials when writing a page to the server, correct?
> 

Yes, and it is expensive. I don't think we want to do that at the
generic VFS layer if we can at all help it.

> > When we specify the new behaviour we should be careful to be as vague as
> > possible while still saying what we need.  This allows filesystems some
> > flexibility.
> > 
> >   If an error happens during writeback, the next write() or fsync() (or
> >   ) on the

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Matthew Wilcox

On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
> On Mon, Apr 03 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> >> So, OK, that makes sense, we should keep allowing filesystems to report
> >> ENOSPC as a writeback error.  But I think much of the argument below
> >> still holds, and we should continue to have a prior EIO to be reported
> >> over a new ENOSPC (even if the program has already consumed the EIO).
> >
> > I'm fine with that (though I'd like Neil's thoughts before we decide
> > anything) there.
> 
> I'd like there be a well defined time when old errors were forgotten.
> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> received, but not forever.
> Clearing the remembered errors when put_write_access() causes
> i_writecount to reach zero is one option (as suggested), but I'm not
> sure I'm happy with it.
> 
> Local filesystems, or network filesystems which receive strong write
> delegations, should only ever return EIO to fsync.  We should
> concentrate on them first, I think.  As there is only one possible
> error, the seq counter is sufficient to "clear" it once it has been
> reported to fsync() (or write()?).
> 
> Other network filesystems could return a whole host of errors: ENOSPC
> EDQUOT ESTALE EPERM EFBIG ...
> Do we want to limit exactly which errors are allowed in generic code, or
> do we just support EIO generically and expect the filesystem to sort out
> the details for anything else?

I'd like us to focus on our POSIX compliance here and not return
arbitrary errors.  The relevant pages are here:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html

For close(), we have to map every error to EIO.
For fsync(), we can return any error that write() could have.  That limits
us to:

EFBIG ENOSPC EIO ENOBUFS ENXIO

I think EFBIG really isn't a writeback error; are there any network
filesystems that don't know the file size limit at the time they accept
the original write?  ENOBUFS seems like a transient error (*this* call to
fsync() failed, but the next one may succeed ... it's the equivalent of
ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
error.  So that leaves us with ENOSPC and EIO, as we have support today.

> One possible approach a filesystem could take is just to allow a single
> async writeback error.  After that error, all subsequent write()
> system calls become synchronous. As write() or fsync() is called on each
> file descriptor (which could possibly have sent the write which caused
> the error), an error is returned and that fact is counted.  Once we have
> returned as many errors as there are open file descriptors
> (i_writecount?), and have seen a successful write, the filesystem
> forgets all recorded errors and switches back to async writes (for that
> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> 
> The "which could possibly have sent the write which caused the error" is
> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> flags to return async errors.  It allocates an nfs_open_context for each
> user who opens a given inode, and stores an error in there.  Each dirty
> pages is associated with one of these, so errors a sure to go to the
> correct user, though not necessarily the correct fd at present.

... and you need the nfs_open_context in order to use the correct
credentials when writing a page to the server, correct?

> When we specify the new behaviour we should be careful to be as vague as
> possible while still saying what we need.  This allows filesystems some
> flexibility.
> 
>   If an error happens during writeback, the next write() or fsync() (or
>   ) on the file descriptor to which data was written will return -1
>   with errno set to EIO or some other relevant error.  Other file
>   descriptors open on the same file may receive EIO or some other error
>   on a subsequent appropriate system call.
>   It should not be assumed that close() will return an error.  fsync()
>   must be called before close() if writeback errors are important to the
>   application.

Thanks for explaining what NFS does today.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Matthew Wilcox

On Tue, Apr 04, 2017 at 01:03:22PM +1000, NeilBrown wrote:
> On Mon, Apr 03 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> >> So, OK, that makes sense, we should keep allowing filesystems to report
> >> ENOSPC as a writeback error.  But I think much of the argument below
> >> still holds, and we should continue to have a prior EIO to be reported
> >> over a new ENOSPC (even if the program has already consumed the EIO).
> >
> > I'm fine with that (though I'd like Neil's thoughts before we decide
> > anything) there.
> 
> I'd like there be a well defined time when old errors were forgotten.
> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> received, but not forever.
> Clearing the remembered errors when put_write_access() causes
> i_writecount to reach zero is one option (as suggested), but I'm not
> sure I'm happy with it.
> 
> Local filesystems, or network filesystems which receive strong write
> delegations, should only ever return EIO to fsync.  We should
> concentrate on them first, I think.  As there is only one possible
> error, the seq counter is sufficient to "clear" it once it has been
> reported to fsync() (or write()?).
> 
> Other network filesystems could return a whole host of errors: ENOSPC
> EDQUOT ESTALE EPERM EFBIG ...
> Do we want to limit exactly which errors are allowed in generic code, or
> do we just support EIO generically and expect the filesystem to sort out
> the details for anything else?

I'd like us to focus on our POSIX compliance here and not return
arbitrary errors.  The relevant pages are here:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/fsync.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/write.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/close.html

For close(), we have to map every error to EIO.
For fsync(), we can return any error that write() could have.  That limits
us to:

EFBIG ENOSPC EIO ENOBUFS ENXIO

I think EFBIG really isn't a writeback error; are there any network
filesystems that don't know the file size limit at the time they accept
the original write?  ENOBUFS seems like a transient error (*this* call to
fsync() failed, but the next one may succeed ... it's the equivalent of
ENOMEM).  ENXIO seems to me like it's a submission error, not a writeback
error.  So that leaves us with ENOSPC and EIO, as we have support today.

> One possible approach a filesystem could take is just to allow a single
> async writeback error.  After that error, all subsequent write()
> system calls become synchronous. As write() or fsync() is called on each
> file descriptor (which could possibly have sent the write which caused
> the error), an error is returned and that fact is counted.  Once we have
> returned as many errors as there are open file descriptors
> (i_writecount?), and have seen a successful write, the filesystem
> forgets all recorded errors and switches back to async writes (for that
> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> 
> The "which could possibly have sent the write which caused the error" is
> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> flags to return async errors.  It allocates an nfs_open_context for each
> user who opens a given inode, and stores an error in there.  Each dirty
> pages is associated with one of these, so errors a sure to go to the
> correct user, though not necessarily the correct fd at present.

... and you need the nfs_open_context in order to use the correct
credentials when writing a page to the server, correct?

> When we specify the new behaviour we should be careful to be as vague as
> possible while still saying what we need.  This allows filesystems some
> flexibility.
> 
>   If an error happens during writeback, the next write() or fsync() (or
>   ) on the file descriptor to which data was written will return -1
>   with errno set to EIO or some other relevant error.  Other file
>   descriptors open on the same file may receive EIO or some other error
>   on a subsequent appropriate system call.
>   It should not be assumed that close() will return an error.  fsync()
>   must be called before close() if writeback errors are important to the
>   application.

Thanks for explaining what NFS does today.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 13:03 +1000, NeilBrown wrote:
> On Mon, Apr 03 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > > > writeback problem.  If I understand correctly, at the time of write(),
> > > > > filesystems check to see if they have enough blocks to satisfy the
> > > > > request, so ENOSPC only comes up in the writeback context for thinly
> > > > > provisioned devices.
> > > > 
> > > > No, ENOSPC on writeback can certainly happen with network filesystems.
> > > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > > > do an extra RPC on every buffered write. :)
> > > 
> > > Aaah, yes, network filesystems.  I would indeed not want to do an extra
> > > RPC on every write to a hole (it's a hole vs non-hole question, rather
> > > than a buffered/unbuffered question ... unless you're WAFLing and not
> > > reclaiming quickly enough, I suppose).
> > > 
> > > So, OK, that makes sense, we should keep allowing filesystems to report
> > > ENOSPC as a writeback error.  But I think much of the argument below
> > > still holds, and we should continue to have a prior EIO to be reported
> > > over a new ENOSPC (even if the program has already consumed the EIO).
> > > 
> > 
> > I'm fine with that (though I'd like Neil's thoughts before we decide
> > anything) there.
> 
> I'd like there be a well defined time when old errors were forgotten.
> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> received, but not forever.
> Clearing the remembered errors when put_write_access() causes
> i_writecount to reach zero is one option (as suggested), but I'm not
> sure I'm happy with it.
> 
> Local filesystems, or network filesystems which receive strong write
> delegations, should only ever return EIO to fsync.  We should
> concentrate on them first, I think.  As there is only one possible
> error, the seq counter is sufficient to "clear" it once it has been
> reported to fsync() (or write()?).
> 
> Other network filesystems could return a whole host of errors: ENOSPC
> EDQUOT ESTALE EPERM EFBIG ...
> Do we want to limit exactly which errors are allowed in generic code, or
> do we just support EIO generically and expect the filesystem to sort out
> the details for anything else?
> 
> One possible approach a filesystem could take is just to allow a single
> async writeback error.  After that error, all subsequent write()
> system calls become synchronous. As write() or fsync() is called on each
> file descriptor (which could possibly have sent the write which caused
> the error), an error is returned and that fact is counted.  Once we have
> returned as many errors as there are open file descriptors
> (i_writecount?), and have seen a successful write, the filesystem
> forgets all recorded errors and switches back to async writes (for that
> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> 
> The "which could possibly have sent the write which caused the error" is
> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> flags to return async errors.  It allocates an nfs_open_context for each
> user who opens a given inode, and stores an error in there.  Each dirty
> pages is associated with one of these, so errors a sure to go to the
> correct user, though not necessarily the correct fd at present.
> 
> When we specify the new behaviour we should be careful to be as vague as
> possible while still saying what we need.  This allows filesystems some
> flexibility.
> 
>   If an error happens during writeback, the next write() or fsync() (or
>   ) on the file descriptor to which data was written will return -1
>   with errno set to EIO or some other relevant error.  Other file
>   descriptors open on the same file may receive EIO or some other error
>   on a subsequent appropriate system call.
>   It should not be assumed that close() will return an error.  fsync()
>   must be called before close() if writeback errors are important to the
>   application.
> 
> 

A lot in here... :)

While I like the NFS method of switching to sync I/O on error (and
indeed, I'm copying that in the Ceph ENOSPC patches I have), I'm not
sure it would really help anything here. The main reason NFS does that
is to prevent you from dirtying tons of pages that can't be cleaned. 

While that is a laudable goal, it's not really the problem I'm
interested in solving here. My goal is simply to ensure that you see a
writeback error on fsync if one occurred since the last fsync.

I think it just comes down to the fact that I'm not convinced that it
really matters much _what_ error gets reported, as long as you get one.
As you've mentioned in earlier discussions, most programs just treat it
as a fatal error anyway. As long as that error is representative of
some error that occurred

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-04 Thread Jeff Layton

On Tue, 2017-04-04 at 13:03 +1000, NeilBrown wrote:
> On Mon, Apr 03 2017, Jeff Layton wrote:
> 
> > On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > > > writeback problem.  If I understand correctly, at the time of write(),
> > > > > filesystems check to see if they have enough blocks to satisfy the
> > > > > request, so ENOSPC only comes up in the writeback context for thinly
> > > > > provisioned devices.
> > > > 
> > > > No, ENOSPC on writeback can certainly happen with network filesystems.
> > > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > > > do an extra RPC on every buffered write. :)
> > > 
> > > Aaah, yes, network filesystems.  I would indeed not want to do an extra
> > > RPC on every write to a hole (it's a hole vs non-hole question, rather
> > > than a buffered/unbuffered question ... unless you're WAFLing and not
> > > reclaiming quickly enough, I suppose).
> > > 
> > > So, OK, that makes sense, we should keep allowing filesystems to report
> > > ENOSPC as a writeback error.  But I think much of the argument below
> > > still holds, and we should continue to have a prior EIO to be reported
> > > over a new ENOSPC (even if the program has already consumed the EIO).
> > > 
> > 
> > I'm fine with that (though I'd like Neil's thoughts before we decide
> > anything) there.
> 
> I'd like there be a well defined time when old errors were forgotten.
> It does make sense for EIO to persist even if ENOSPC or EDQUOT is
> received, but not forever.
> Clearing the remembered errors when put_write_access() causes
> i_writecount to reach zero is one option (as suggested), but I'm not
> sure I'm happy with it.
> 
> Local filesystems, or network filesystems which receive strong write
> delegations, should only ever return EIO to fsync.  We should
> concentrate on them first, I think.  As there is only one possible
> error, the seq counter is sufficient to "clear" it once it has been
> reported to fsync() (or write()?).
> 
> Other network filesystems could return a whole host of errors: ENOSPC
> EDQUOT ESTALE EPERM EFBIG ...
> Do we want to limit exactly which errors are allowed in generic code, or
> do we just support EIO generically and expect the filesystem to sort out
> the details for anything else?
> 
> One possible approach a filesystem could take is just to allow a single
> async writeback error.  After that error, all subsequent write()
> system calls become synchronous. As write() or fsync() is called on each
> file descriptor (which could possibly have sent the write which caused
> the error), an error is returned and that fact is counted.  Once we have
> returned as many errors as there are open file descriptors
> (i_writecount?), and have seen a successful write, the filesystem
> forgets all recorded errors and switches back to async writes (for that
> inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().
> 
> The "which could possibly have sent the write which caused the error" is
> an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
> flags to return async errors.  It allocates an nfs_open_context for each
> user who opens a given inode, and stores an error in there.  Each dirty
> pages is associated with one of these, so errors a sure to go to the
> correct user, though not necessarily the correct fd at present.
> 
> When we specify the new behaviour we should be careful to be as vague as
> possible while still saying what we need.  This allows filesystems some
> flexibility.
> 
>   If an error happens during writeback, the next write() or fsync() (or
>   ) on the file descriptor to which data was written will return -1
>   with errno set to EIO or some other relevant error.  Other file
>   descriptors open on the same file may receive EIO or some other error
>   on a subsequent appropriate system call.
>   It should not be assumed that close() will return an error.  fsync()
>   must be called before close() if writeback errors are important to the
>   application.
> 
> 

A lot in here... :)

While I like the NFS method of switching to sync I/O on error (and
indeed, I'm copying that in the Ceph ENOSPC patches I have), I'm not
sure it would really help anything here. The main reason NFS does that
is to prevent you from dirtying tons of pages that can't be cleaned. 

While that is a laudable goal, it's not really the problem I'm
interested in solving here. My goal is simply to ensure that you see a
writeback error on fsync if one occurred since the last fsync.

I think it just comes down to the fact that I'm not convinced that it
really matters much _what_ error gets reported, as long as you get one.
As you've mentioned in earlier discussions, most programs just treat it
as a fatal error anyway. As long as that error is representative of
some error that occurred

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread NeilBrown

On Mon, Apr 03 2017, Jeff Layton wrote:

> On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
>> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
>> > > writeback problem.  If I understand correctly, at the time of write(),
>> > > filesystems check to see if they have enough blocks to satisfy the
>> > > request, so ENOSPC only comes up in the writeback context for thinly
>> > > provisioned devices.
>> > 
>> > No, ENOSPC on writeback can certainly happen with network filesystems.
>> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
>> > do an extra RPC on every buffered write. :)
>> 
>> Aaah, yes, network filesystems.  I would indeed not want to do an extra
>> RPC on every write to a hole (it's a hole vs non-hole question, rather
>> than a buffered/unbuffered question ... unless you're WAFLing and not
>> reclaiming quickly enough, I suppose).
>> 
>> So, OK, that makes sense, we should keep allowing filesystems to report
>> ENOSPC as a writeback error.  But I think much of the argument below
>> still holds, and we should continue to have a prior EIO to be reported
>> over a new ENOSPC (even if the program has already consumed the EIO).
>> 
>
> I'm fine with that (though I'd like Neil's thoughts before we decide
> anything) there.

I'd like there be a well defined time when old errors were forgotten.
It does make sense for EIO to persist even if ENOSPC or EDQUOT is
received, but not forever.
Clearing the remembered errors when put_write_access() causes
i_writecount to reach zero is one option (as suggested), but I'm not
sure I'm happy with it.

Local filesystems, or network filesystems which receive strong write
delegations, should only ever return EIO to fsync.  We should
concentrate on them first, I think.  As there is only one possible
error, the seq counter is sufficient to "clear" it once it has been
reported to fsync() (or write()?).

Other network filesystems could return a whole host of errors: ENOSPC
EDQUOT ESTALE EPERM EFBIG ...
Do we want to limit exactly which errors are allowed in generic code, or
do we just support EIO generically and expect the filesystem to sort out
the details for anything else?

One possible approach a filesystem could take is just to allow a single
async writeback error.  After that error, all subsequent write()
system calls become synchronous. As write() or fsync() is called on each
file descriptor (which could possibly have sent the write which caused
the error), an error is returned and that fact is counted.  Once we have
returned as many errors as there are open file descriptors
(i_writecount?), and have seen a successful write, the filesystem
forgets all recorded errors and switches back to async writes (for that
inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().

The "which could possibly have sent the write which caused the error" is
an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
flags to return async errors.  It allocates an nfs_open_context for each
user who opens a given inode, and stores an error in there.  Each dirty
pages is associated with one of these, so errors a sure to go to the
correct user, though not necessarily the correct fd at present.

When we specify the new behaviour we should be careful to be as vague as
possible while still saying what we need.  This allows filesystems some
flexibility.

  If an error happens during writeback, the next write() or fsync() (or
  ) on the file descriptor to which data was written will return -1
  with errno set to EIO or some other relevant error.  Other file
  descriptors open on the same file may receive EIO or some other error
  on a subsequent appropriate system call.
  It should not be assumed that close() will return an error.  fsync()
  must be called before close() if writeback errors are important to the
  application.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread NeilBrown

On Mon, Apr 03 2017, Jeff Layton wrote:

> On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
>> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
>> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
>> > > writeback problem.  If I understand correctly, at the time of write(),
>> > > filesystems check to see if they have enough blocks to satisfy the
>> > > request, so ENOSPC only comes up in the writeback context for thinly
>> > > provisioned devices.
>> > 
>> > No, ENOSPC on writeback can certainly happen with network filesystems.
>> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
>> > do an extra RPC on every buffered write. :)
>> 
>> Aaah, yes, network filesystems.  I would indeed not want to do an extra
>> RPC on every write to a hole (it's a hole vs non-hole question, rather
>> than a buffered/unbuffered question ... unless you're WAFLing and not
>> reclaiming quickly enough, I suppose).
>> 
>> So, OK, that makes sense, we should keep allowing filesystems to report
>> ENOSPC as a writeback error.  But I think much of the argument below
>> still holds, and we should continue to have a prior EIO to be reported
>> over a new ENOSPC (even if the program has already consumed the EIO).
>> 
>
> I'm fine with that (though I'd like Neil's thoughts before we decide
> anything) there.

I'd like there be a well defined time when old errors were forgotten.
It does make sense for EIO to persist even if ENOSPC or EDQUOT is
received, but not forever.
Clearing the remembered errors when put_write_access() causes
i_writecount to reach zero is one option (as suggested), but I'm not
sure I'm happy with it.

Local filesystems, or network filesystems which receive strong write
delegations, should only ever return EIO to fsync.  We should
concentrate on them first, I think.  As there is only one possible
error, the seq counter is sufficient to "clear" it once it has been
reported to fsync() (or write()?).

Other network filesystems could return a whole host of errors: ENOSPC
EDQUOT ESTALE EPERM EFBIG ...
Do we want to limit exactly which errors are allowed in generic code, or
do we just support EIO generically and expect the filesystem to sort out
the details for anything else?

One possible approach a filesystem could take is just to allow a single
async writeback error.  After that error, all subsequent write()
system calls become synchronous. As write() or fsync() is called on each
file descriptor (which could possibly have sent the write which caused
the error), an error is returned and that fact is counted.  Once we have
returned as many errors as there are open file descriptors
(i_writecount?), and have seen a successful write, the filesystem
forgets all recorded errors and switches back to async writes (for that
inode).   NFS does this switch-to-sync-on-error.  See nfs_need_check_write().

The "which could possibly have sent the write which caused the error" is
an explicit reference to NFS.  NFS doesn't use the AS_EIO/AS_ENOSPC
flags to return async errors.  It allocates an nfs_open_context for each
user who opens a given inode, and stores an error in there.  Each dirty
pages is associated with one of these, so errors a sure to go to the
correct user, though not necessarily the correct fd at present.

When we specify the new behaviour we should be careful to be as vague as
possible while still saying what we need.  This allows filesystems some
flexibility.

  If an error happens during writeback, the next write() or fsync() (or
  ) on the file descriptor to which data was written will return -1
  with errno set to EIO or some other relevant error.  Other file
  descriptors open on the same file may receive EIO or some other error
  on a subsequent appropriate system call.
  It should not be assumed that close() will return an error.  fsync()
  must be called before close() if writeback errors are important to the
  application.

Thanks,
NeilBrown

signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 04:16:17PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > int filemap_report_wb_error(struct file *file)
> > {
> > struct inode *inode = file_inode(file);
> > struct address_space *mapping = file->f_mapping;
> > int err;
> > 
> > spin_lock(>i_lock);
> > if (file->f_wb_err == mapping->wb_err) {
> > err = 0;
> > } else if (mapping->wb_err & 1) {
> > filp->f_wb_err = mapping->wb_err & ~2;
> > err = -EIO;
> > } else {
> > filp->f_wb_err = mapping->wb_err;
> > err = -ENOSPC;
> > }
> > spin_unlock(>i_lock);
> > return err;
> > }
> > 
> > If I got that right, calling fsync() on an inode which has experienced
> > both errors would first get an EIO.  Calling fsync() on it again would
> > get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
> > either error occurs again, the thread will go back through the cycle
> > (EIO -> ENOSPC -> 0).
> > 
> 
> I don't think so? mapping->wb_err would still have 0x1 set after the
> first call so you'd always end up in the first else if branch.
> 
> It's getting toward beer 30 here though so I could be misreading it.

Well, yes, of course you misread it.  You read what I actually wrote
instead of what I intended to write.  Silly Jeff ...

int filemap_report_wb_error(struct file *file)
{
struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
int err;

spin_lock(>i_lock);
if (file->f_wb_err == mapping->wb_err) {
err = 0;
} else if ((mapping->wb_err ^ file->f_wb_err) == 2) {
filp->f_wb_err = mapping->wb_err;
err = -ENOSPC;
} else {
filp->f_wb_err = mapping->wb_err & ~2;
err = -EIO;
}
spin_unlock(>i_lock);
return err;
}

The read side is easier in terms of atomic ...

int filemap_report_wb_error(struct file *file)
{
unsigned int wb_err = atomic_read(>f_mapping->wb_err)

if (file->f_wb_err == wb_err)
return 0;
if ((file->f_wb_err ^ wb_err) == 2) {
filp->f_wb_err = wb_err;
return -ENOSPC;
} else {
filp->f_wb_err = wb_err & ~2;
return -EIO;
}
}

but doing the write side with an atomic looks incredibly painful.  Since
we don't actually need to make the write side scalable, I'd rather see the
write side continue to use a spinlock and do the read side this way:

int filemap_report_wb_error(struct file *file)
{
unsigned int wb_err = READ_ONCE(file->f_mapping->wb_err)

if (file->f_wb_err == wb_err)
return 0;
if ((file->f_wb_err ^ wb_err) == 2) {
filp->f_wb_err = wb_err;
return -ENOSPC;
} else {
filp->f_wb_err = wb_err & ~2;
return -EIO;
}
}

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 04:16:17PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> > int filemap_report_wb_error(struct file *file)
> > {
> > struct inode *inode = file_inode(file);
> > struct address_space *mapping = file->f_mapping;
> > int err;
> > 
> > spin_lock(>i_lock);
> > if (file->f_wb_err == mapping->wb_err) {
> > err = 0;
> > } else if (mapping->wb_err & 1) {
> > filp->f_wb_err = mapping->wb_err & ~2;
> > err = -EIO;
> > } else {
> > filp->f_wb_err = mapping->wb_err;
> > err = -ENOSPC;
> > }
> > spin_unlock(>i_lock);
> > return err;
> > }
> > 
> > If I got that right, calling fsync() on an inode which has experienced
> > both errors would first get an EIO.  Calling fsync() on it again would
> > get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
> > either error occurs again, the thread will go back through the cycle
> > (EIO -> ENOSPC -> 0).
> > 
> 
> I don't think so? mapping->wb_err would still have 0x1 set after the
> first call so you'd always end up in the first else if branch.
> 
> It's getting toward beer 30 here though so I could be misreading it.

Well, yes, of course you misread it.  You read what I actually wrote
instead of what I intended to write.  Silly Jeff ...

int filemap_report_wb_error(struct file *file)
{
struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
int err;

spin_lock(>i_lock);
if (file->f_wb_err == mapping->wb_err) {
err = 0;
} else if ((mapping->wb_err ^ file->f_wb_err) == 2) {
filp->f_wb_err = mapping->wb_err;
err = -ENOSPC;
} else {
filp->f_wb_err = mapping->wb_err & ~2;
err = -EIO;
}
spin_unlock(>i_lock);
return err;
}

The read side is easier in terms of atomic ...

int filemap_report_wb_error(struct file *file)
{
unsigned int wb_err = atomic_read(>f_mapping->wb_err)

if (file->f_wb_err == wb_err)
return 0;
if ((file->f_wb_err ^ wb_err) == 2) {
filp->f_wb_err = wb_err;
return -ENOSPC;
} else {
filp->f_wb_err = wb_err & ~2;
return -EIO;
}
}

but doing the write side with an atomic looks incredibly painful.  Since
we don't actually need to make the write side scalable, I'd rather see the
write side continue to use a spinlock and do the read side this way:

int filemap_report_wb_error(struct file *file)
{
unsigned int wb_err = READ_ONCE(file->f_mapping->wb_err)

if (file->f_wb_err == wb_err)
return 0;
if ((file->f_wb_err ^ wb_err) == 2) {
filp->f_wb_err = wb_err;
return -ENOSPC;
} else {
filp->f_wb_err = wb_err & ~2;
return -EIO;
}
}

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > writeback problem.  If I understand correctly, at the time of write(),
> > > filesystems check to see if they have enough blocks to satisfy the
> > > request, so ENOSPC only comes up in the writeback context for thinly
> > > provisioned devices.
> > 
> > No, ENOSPC on writeback can certainly happen with network filesystems.
> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > do an extra RPC on every buffered write. :)
> 
> Aaah, yes, network filesystems.  I would indeed not want to do an extra
> RPC on every write to a hole (it's a hole vs non-hole question, rather
> than a buffered/unbuffered question ... unless you're WAFLing and not
> reclaiming quickly enough, I suppose).
> 
> So, OK, that makes sense, we should keep allowing filesystems to report
> ENOSPC as a writeback error.  But I think much of the argument below
> still holds, and we should continue to have a prior EIO to be reported
> over a new ENOSPC (even if the program has already consumed the EIO).
> 

I'm fine with that (though I'd like Neil's thoughts before we decide
anything) there.

> If you find that unconvincing, we could do something like this ...
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
> 
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0);
> 
>   spin_lock(>i_lock);
>   if (err == -EIO)
>   mapping->wb_err |= 1;
>   else if (err == -ENOSPC)
>   mapping->wb_err |= 2;
>   mapping->wb_err += 4;
>   spin_unlock(>i_lock);
> }
> 
> int filemap_report_wb_error(struct file *file)
> {
>   struct inode *inode = file_inode(file);
>   struct address_space *mapping = file->f_mapping;
>   int err;
> 
>   spin_lock(>i_lock);
>   if (file->f_wb_err == mapping->wb_err) {
>   err = 0;
>   } else if (mapping->wb_err & 1) {
>   filp->f_wb_err = mapping->wb_err & ~2;
>   err = -EIO;
>   } else {
>   filp->f_wb_err = mapping->wb_err;
>   err = -ENOSPC;
>   }
>   spin_unlock(>i_lock);
>   return err;
> }
> 
> If I got that right, calling fsync() on an inode which has experienced
> both errors would first get an EIO.  Calling fsync() on it again would
> get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
> either error occurs again, the thread will go back through the cycle
> (EIO -> ENOSPC -> 0).
> 

I don't think so? mapping->wb_err would still have 0x1 set after the
first call so you'd always end up in the first else if branch.

It's getting toward beer 30 here though so I could be misreading it.

In any case, I'd rather not do this any more cleverly than we have to.
Simpler is better here, and letting EIO override ENOSPC would be
preferable to me.

Also, since we're going to do this with 32 bit values, it might be nice
to use atomics and not need the spinlock there, especially if we want
to be able to clear that value out when the i_writecount goes to 0.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 12:16 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > writeback problem.  If I understand correctly, at the time of write(),
> > > filesystems check to see if they have enough blocks to satisfy the
> > > request, so ENOSPC only comes up in the writeback context for thinly
> > > provisioned devices.
> > 
> > No, ENOSPC on writeback can certainly happen with network filesystems.
> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > do an extra RPC on every buffered write. :)
> 
> Aaah, yes, network filesystems.  I would indeed not want to do an extra
> RPC on every write to a hole (it's a hole vs non-hole question, rather
> than a buffered/unbuffered question ... unless you're WAFLing and not
> reclaiming quickly enough, I suppose).
> 
> So, OK, that makes sense, we should keep allowing filesystems to report
> ENOSPC as a writeback error.  But I think much of the argument below
> still holds, and we should continue to have a prior EIO to be reported
> over a new ENOSPC (even if the program has already consumed the EIO).
> 

I'm fine with that (though I'd like Neil's thoughts before we decide
anything) there.

> If you find that unconvincing, we could do something like this ...
> 
> void filemap_set_wb_error(struct address_space *mapping, int err)
> {
>   struct inode *inode = mapping->host;
> 
>   if (!err)
>   return;
>   /*
>* This should be called with the error code that we want to return
>* on fsync. Thus, it should always be <= 0.
>*/
>   WARN_ON(err > 0);
> 
>   spin_lock(>i_lock);
>   if (err == -EIO)
>   mapping->wb_err |= 1;
>   else if (err == -ENOSPC)
>   mapping->wb_err |= 2;
>   mapping->wb_err += 4;
>   spin_unlock(>i_lock);
> }
> 
> int filemap_report_wb_error(struct file *file)
> {
>   struct inode *inode = file_inode(file);
>   struct address_space *mapping = file->f_mapping;
>   int err;
> 
>   spin_lock(>i_lock);
>   if (file->f_wb_err == mapping->wb_err) {
>   err = 0;
>   } else if (mapping->wb_err & 1) {
>   filp->f_wb_err = mapping->wb_err & ~2;
>   err = -EIO;
>   } else {
>   filp->f_wb_err = mapping->wb_err;
>   err = -ENOSPC;
>   }
>   spin_unlock(>i_lock);
>   return err;
> }
> 
> If I got that right, calling fsync() on an inode which has experienced
> both errors would first get an EIO.  Calling fsync() on it again would
> get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
> either error occurs again, the thread will go back through the cycle
> (EIO -> ENOSPC -> 0).
> 

I don't think so? mapping->wb_err would still have 0x1 set after the
first call so you'd always end up in the first else if branch.

It's getting toward beer 30 here though so I could be misreading it.

In any case, I'd rather not do this any more cleverly than we have to.
Simpler is better here, and letting EIO override ENOSPC would be
preferable to me.

Also, since we're going to do this with 32 bit values, it might be nice
to use atomics and not need the spinlock there, especially if we want
to be able to clear that value out when the i_writecount goes to 0.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > writeback problem.  If I understand correctly, at the time of write(),
> > filesystems check to see if they have enough blocks to satisfy the
> > request, so ENOSPC only comes up in the writeback context for thinly
> > provisioned devices.
> 
> No, ENOSPC on writeback can certainly happen with network filesystems.
> NFS and CIFS have no way to reserve space. You wouldn't want to have to
> do an extra RPC on every buffered write. :)

Aaah, yes, network filesystems.  I would indeed not want to do an extra
RPC on every write to a hole (it's a hole vs non-hole question, rather
than a buffered/unbuffered question ... unless you're WAFLing and not
reclaiming quickly enough, I suppose).

So, OK, that makes sense, we should keep allowing filesystems to report
ENOSPC as a writeback error.  But I think much of the argument below
still holds, and we should continue to have a prior EIO to be reported
over a new ENOSPC (even if the program has already consumed the EIO).

If you find that unconvincing, we could do something like this ...

void filemap_set_wb_error(struct address_space *mapping, int err)
{
struct inode *inode = mapping->host;

if (!err)
return;
/*
 * This should be called with the error code that we want to return
 * on fsync. Thus, it should always be <= 0.
 */
WARN_ON(err > 0);

spin_lock(>i_lock);
if (err == -EIO)
mapping->wb_err |= 1;
else if (err == -ENOSPC)
mapping->wb_err |= 2;
mapping->wb_err += 4;
spin_unlock(>i_lock);
}

int filemap_report_wb_error(struct file *file)
{
struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
int err;

spin_lock(>i_lock);
if (file->f_wb_err == mapping->wb_err) {
err = 0;
} else if (mapping->wb_err & 1) {
filp->f_wb_err = mapping->wb_err & ~2;
err = -EIO;
} else {
filp->f_wb_err = mapping->wb_err;
err = -ENOSPC;
}
spin_unlock(>i_lock);
return err;
}

If I got that right, calling fsync() on an inode which has experienced
both errors would first get an EIO.  Calling fsync() on it again would
get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
either error occurs again, the thread will go back through the cycle
(EIO -> ENOSPC -> 0).

> > Programs have basically no use for the distinction.  In either case,
> > the situation is the same.  The written data is safely in RAM and cannot
> > be written to the storage.  If one were to make superhuman efforts,
> > one could mmap the file and write() it to a different device, but that
> > is incredibly rare.  For most programs, the response is to just die and
> > let the human deal with the corrupted file.
> > 
> > From a sysadmin point of view, of course the situation is different,
> > and the remedy is different, but they should be getting that information
> > through a different mechanism than monitoring the errno from every
> > system call.
> > 
> > If we do want to continue to support both EIO and ENOSPC from writeback,
> > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > in after an EIO is set, it only bumps the counter and applications will
> > see EIO, not ENOSPC on fresh calls to fsync().

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > writeback problem.  If I understand correctly, at the time of write(),
> > filesystems check to see if they have enough blocks to satisfy the
> > request, so ENOSPC only comes up in the writeback context for thinly
> > provisioned devices.
> 
> No, ENOSPC on writeback can certainly happen with network filesystems.
> NFS and CIFS have no way to reserve space. You wouldn't want to have to
> do an extra RPC on every buffered write. :)

Aaah, yes, network filesystems.  I would indeed not want to do an extra
RPC on every write to a hole (it's a hole vs non-hole question, rather
than a buffered/unbuffered question ... unless you're WAFLing and not
reclaiming quickly enough, I suppose).

So, OK, that makes sense, we should keep allowing filesystems to report
ENOSPC as a writeback error.  But I think much of the argument below
still holds, and we should continue to have a prior EIO to be reported
over a new ENOSPC (even if the program has already consumed the EIO).

If you find that unconvincing, we could do something like this ...

void filemap_set_wb_error(struct address_space *mapping, int err)
{
struct inode *inode = mapping->host;

if (!err)
return;
/*
 * This should be called with the error code that we want to return
 * on fsync. Thus, it should always be <= 0.
 */
WARN_ON(err > 0);

spin_lock(>i_lock);
if (err == -EIO)
mapping->wb_err |= 1;
else if (err == -ENOSPC)
mapping->wb_err |= 2;
mapping->wb_err += 4;
spin_unlock(>i_lock);
}

int filemap_report_wb_error(struct file *file)
{
struct inode *inode = file_inode(file);
struct address_space *mapping = file->f_mapping;
int err;

spin_lock(>i_lock);
if (file->f_wb_err == mapping->wb_err) {
err = 0;
} else if (mapping->wb_err & 1) {
filp->f_wb_err = mapping->wb_err & ~2;
err = -EIO;
} else {
filp->f_wb_err = mapping->wb_err;
err = -ENOSPC;
}
spin_unlock(>i_lock);
return err;
}

If I got that right, calling fsync() on an inode which has experienced
both errors would first get an EIO.  Calling fsync() on it again would
get an ENOSPC.  Calling fsync() on it a third time would get 0.  When
either error occurs again, the thread will go back through the cycle
(EIO -> ENOSPC -> 0).

> > Programs have basically no use for the distinction.  In either case,
> > the situation is the same.  The written data is safely in RAM and cannot
> > be written to the storage.  If one were to make superhuman efforts,
> > one could mmap the file and write() it to a different device, but that
> > is incredibly rare.  For most programs, the response is to just die and
> > let the human deal with the corrupted file.
> > 
> > From a sysadmin point of view, of course the situation is different,
> > and the remedy is different, but they should be getting that information
> > through a different mechanism than monitoring the errno from every
> > system call.
> > 
> > If we do want to continue to support both EIO and ENOSPC from writeback,
> > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > in after an EIO is set, it only bumps the counter and applications will
> > see EIO, not ENOSPC on fresh calls to fsync().

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 11:40 -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 11:36:48AM -0700, Jeremy Allison wrote:
> > On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > > > 
> > > > CIFS has a way to reserve space. Look into "allocation size" on create.
> > > 
> > > That won't help here as it's done on open().
> > > 
> > > The problem here is that we might create a file (and not preallocate
> > > anything), then write a bunch of stuff to the cache under an oplock.
> > > Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> > > 
> > > What local filesystems do (AIUI) is preallocate so that you can catch
> > > an ENOSPC condition earlier, when you're dirtying new pages in the
> > > cache. That's pretty much impossible to do on a network filesystem
> > > though.
> > 
> > There's also SMB_SET_FILE_ALLOCATION_INFO which can be
> > done over SMB1/2/3 on an open file handle.
> 
> There's *always* a way to do something in SMB1/2/3. :-).

Yes, indeed...Still, I think we'll need to deal with this during
writeback as well. Earlier versions of NFS certainly don't have
anything along those lines, though you could probably do some sort of
speculative preallocation with v4.2.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 11:40 -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 11:36:48AM -0700, Jeremy Allison wrote:
> > On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > > > 
> > > > CIFS has a way to reserve space. Look into "allocation size" on create.
> > > 
> > > That won't help here as it's done on open().
> > > 
> > > The problem here is that we might create a file (and not preallocate
> > > anything), then write a bunch of stuff to the cache under an oplock.
> > > Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> > > 
> > > What local filesystems do (AIUI) is preallocate so that you can catch
> > > an ENOSPC condition earlier, when you're dirtying new pages in the
> > > cache. That's pretty much impossible to do on a network filesystem
> > > though.
> > 
> > There's also SMB_SET_FILE_ALLOCATION_INFO which can be
> > done over SMB1/2/3 on an open file handle.
> 
> There's *always* a way to do something in SMB1/2/3. :-).

Yes, indeed...Still, I think we'll need to deal with this during
writeback as well. Earlier versions of NFS certainly don't have
anything along those lines, though you could probably do some sort of
speculative preallocation with v4.2.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeremy Allison

On Mon, Apr 03, 2017 at 11:36:48AM -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > > 
> > > CIFS has a way to reserve space. Look into "allocation size" on create.
> > 
> > That won't help here as it's done on open().
> > 
> > The problem here is that we might create a file (and not preallocate
> > anything), then write a bunch of stuff to the cache under an oplock.
> > Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> > 
> > What local filesystems do (AIUI) is preallocate so that you can catch
> > an ENOSPC condition earlier, when you're dirtying new pages in the
> > cache. That's pretty much impossible to do on a network filesystem
> > though.
> 
> There's also SMB_SET_FILE_ALLOCATION_INFO which can be
> done over SMB1/2/3 on an open file handle.

There's *always* a way to do something in SMB1/2/3. :-).

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeremy Allison

On Mon, Apr 03, 2017 at 11:36:48AM -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > > 
> > > CIFS has a way to reserve space. Look into "allocation size" on create.
> > 
> > That won't help here as it's done on open().
> > 
> > The problem here is that we might create a file (and not preallocate
> > anything), then write a bunch of stuff to the cache under an oplock.
> > Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> > 
> > What local filesystems do (AIUI) is preallocate so that you can catch
> > an ENOSPC condition earlier, when you're dirtying new pages in the
> > cache. That's pretty much impossible to do on a network filesystem
> > though.
> 
> There's also SMB_SET_FILE_ALLOCATION_INFO which can be
> done over SMB1/2/3 on an open file handle.

There's *always* a way to do something in SMB1/2/3. :-).

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeremy Allison

On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > > > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > > > responses are different.  That probably means you need a separate 
> > > > > > seq
> > > > > > number for each, which isn't ideal.
> > > > > > 
> > > > > 
> > > > > I'm not quite convinced that it's really useful to do anything but
> > > > > report the latest error.
> > > > > 
> > > > > But...if we did need to prefer one over another, could we get away 
> > > > > with
> > > > > always reporting -EIO once that error occurs? If so, then we'd still
> > > > > just need a single sequence counter.
> > > > 
> > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > > writeback problem.  If I understand correctly, at the time of write(),
> > > > filesystems check to see if they have enough blocks to satisfy the
> > > > request, so ENOSPC only comes up in the writeback context for thinly
> > > > provisioned devices.
> > > > 
> > > > Programs have basically no use for the distinction.  In either case,
> > > > the situation is the same.  The written data is safely in RAM and cannot
> > > > be written to the storage.  If one were to make superhuman efforts,
> > > > one could mmap the file and write() it to a different device, but that
> > > > is incredibly rare.  For most programs, the response is to just die and
> > > > let the human deal with the corrupted file.
> > > > 
> > > > From a sysadmin point of view, of course the situation is different,
> > > > and the remedy is different, but they should be getting that information
> > > > through a different mechanism than monitoring the errno from every
> > > > system call.
> > > > 
> > > > If we do want to continue to support both EIO and ENOSPC from writeback,
> > > > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > > > in after an EIO is set, it only bumps the counter and applications will
> > > > see EIO, not ENOSPC on fresh calls to fsync().
> > > 
> > > 
> > > No, ENOSPC on writeback can certainly happen with network filesystems.
> > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > > do an extra RPC on every buffered write. :)
> > 
> > CIFS has a way to reserve space. Look into "allocation size" on create.
> 
> That won't help here as it's done on open().
> 
> The problem here is that we might create a file (and not preallocate
> anything), then write a bunch of stuff to the cache under an oplock.
> Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> 
> What local filesystems do (AIUI) is preallocate so that you can catch
> an ENOSPC condition earlier, when you're dirtying new pages in the
> cache. That's pretty much impossible to do on a network filesystem
> though.

There's also SMB_SET_FILE_ALLOCATION_INFO which can be
done over SMB1/2/3 on an open file handle.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeremy Allison

On Mon, Apr 03, 2017 at 02:18:44PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> > On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > > > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > > > responses are different.  That probably means you need a separate 
> > > > > > seq
> > > > > > number for each, which isn't ideal.
> > > > > > 
> > > > > 
> > > > > I'm not quite convinced that it's really useful to do anything but
> > > > > report the latest error.
> > > > > 
> > > > > But...if we did need to prefer one over another, could we get away 
> > > > > with
> > > > > always reporting -EIO once that error occurs? If so, then we'd still
> > > > > just need a single sequence counter.
> > > > 
> > > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > > writeback problem.  If I understand correctly, at the time of write(),
> > > > filesystems check to see if they have enough blocks to satisfy the
> > > > request, so ENOSPC only comes up in the writeback context for thinly
> > > > provisioned devices.
> > > > 
> > > > Programs have basically no use for the distinction.  In either case,
> > > > the situation is the same.  The written data is safely in RAM and cannot
> > > > be written to the storage.  If one were to make superhuman efforts,
> > > > one could mmap the file and write() it to a different device, but that
> > > > is incredibly rare.  For most programs, the response is to just die and
> > > > let the human deal with the corrupted file.
> > > > 
> > > > From a sysadmin point of view, of course the situation is different,
> > > > and the remedy is different, but they should be getting that information
> > > > through a different mechanism than monitoring the errno from every
> > > > system call.
> > > > 
> > > > If we do want to continue to support both EIO and ENOSPC from writeback,
> > > > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > > > in after an EIO is set, it only bumps the counter and applications will
> > > > see EIO, not ENOSPC on fresh calls to fsync().
> > > 
> > > 
> > > No, ENOSPC on writeback can certainly happen with network filesystems.
> > > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > > do an extra RPC on every buffered write. :)
> > 
> > CIFS has a way to reserve space. Look into "allocation size" on create.
> 
> That won't help here as it's done on open().
> 
> The problem here is that we might create a file (and not preallocate
> anything), then write a bunch of stuff to the cache under an oplock.
> Then when we go to write back, we get the CIFS equivalent of -ENOSPC.
> 
> What local filesystems do (AIUI) is preallocate so that you can catch
> an ENOSPC condition earlier, when you're dirtying new pages in the
> cache. That's pretty much impossible to do on a network filesystem
> though.

There's also SMB_SET_FILE_ALLOCATION_INFO which can be
done over SMB1/2/3 on an open file handle.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > > responses are different.  That probably means you need a separate seq
> > > > > number for each, which isn't ideal.
> > > > > 
> > > > 
> > > > I'm not quite convinced that it's really useful to do anything but
> > > > report the latest error.
> > > > 
> > > > But...if we did need to prefer one over another, could we get away with
> > > > always reporting -EIO once that error occurs? If so, then we'd still
> > > > just need a single sequence counter.
> > > 
> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > writeback problem.  If I understand correctly, at the time of write(),
> > > filesystems check to see if they have enough blocks to satisfy the
> > > request, so ENOSPC only comes up in the writeback context for thinly
> > > provisioned devices.
> > > 
> > > Programs have basically no use for the distinction.  In either case,
> > > the situation is the same.  The written data is safely in RAM and cannot
> > > be written to the storage.  If one were to make superhuman efforts,
> > > one could mmap the file and write() it to a different device, but that
> > > is incredibly rare.  For most programs, the response is to just die and
> > > let the human deal with the corrupted file.
> > > 
> > > From a sysadmin point of view, of course the situation is different,
> > > and the remedy is different, but they should be getting that information
> > > through a different mechanism than monitoring the errno from every
> > > system call.
> > > 
> > > If we do want to continue to support both EIO and ENOSPC from writeback,
> > > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > > in after an EIO is set, it only bumps the counter and applications will
> > > see EIO, not ENOSPC on fresh calls to fsync().
> > 
> > 
> > No, ENOSPC on writeback can certainly happen with network filesystems.
> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > do an extra RPC on every buffered write. :)
> 
> CIFS has a way to reserve space. Look into "allocation size" on create.

That won't help here as it's done on open().

The problem here is that we might create a file (and not preallocate
anything), then write a bunch of stuff to the cache under an oplock.
Then when we go to write back, we get the CIFS equivalent of -ENOSPC.

What local filesystems do (AIUI) is preallocate so that you can catch
an ENOSPC condition earlier, when you're dirtying new pages in the
cache. That's pretty much impossible to do on a network filesystem
though.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 11:09 -0700, Jeremy Allison wrote:
> On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > > responses are different.  That probably means you need a separate seq
> > > > > number for each, which isn't ideal.
> > > > > 
> > > > 
> > > > I'm not quite convinced that it's really useful to do anything but
> > > > report the latest error.
> > > > 
> > > > But...if we did need to prefer one over another, could we get away with
> > > > always reporting -EIO once that error occurs? If so, then we'd still
> > > > just need a single sequence counter.
> > > 
> > > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > > writeback problem.  If I understand correctly, at the time of write(),
> > > filesystems check to see if they have enough blocks to satisfy the
> > > request, so ENOSPC only comes up in the writeback context for thinly
> > > provisioned devices.
> > > 
> > > Programs have basically no use for the distinction.  In either case,
> > > the situation is the same.  The written data is safely in RAM and cannot
> > > be written to the storage.  If one were to make superhuman efforts,
> > > one could mmap the file and write() it to a different device, but that
> > > is incredibly rare.  For most programs, the response is to just die and
> > > let the human deal with the corrupted file.
> > > 
> > > From a sysadmin point of view, of course the situation is different,
> > > and the remedy is different, but they should be getting that information
> > > through a different mechanism than monitoring the errno from every
> > > system call.
> > > 
> > > If we do want to continue to support both EIO and ENOSPC from writeback,
> > > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > > in after an EIO is set, it only bumps the counter and applications will
> > > see EIO, not ENOSPC on fresh calls to fsync().
> > 
> > 
> > No, ENOSPC on writeback can certainly happen with network filesystems.
> > NFS and CIFS have no way to reserve space. You wouldn't want to have to
> > do an extra RPC on every buffered write. :)
> 
> CIFS has a way to reserve space. Look into "allocation size" on create.

That won't help here as it's done on open().

The problem here is that we might create a file (and not preallocate
anything), then write a bunch of stuff to the cache under an oplock.
Then when we go to write back, we get the CIFS equivalent of -ENOSPC.

What local filesystems do (AIUI) is preallocate so that you can catch
an ENOSPC condition earlier, when you're dirtying new pages in the
cache. That's pretty much impossible to do on a network filesystem
though.

-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeremy Allison

On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > responses are different.  That probably means you need a separate seq
> > > > number for each, which isn't ideal.
> > > > 
> > > 
> > > I'm not quite convinced that it's really useful to do anything but
> > > report the latest error.
> > > 
> > > But...if we did need to prefer one over another, could we get away with
> > > always reporting -EIO once that error occurs? If so, then we'd still
> > > just need a single sequence counter.
> > 
> > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > writeback problem.  If I understand correctly, at the time of write(),
> > filesystems check to see if they have enough blocks to satisfy the
> > request, so ENOSPC only comes up in the writeback context for thinly
> > provisioned devices.
> > 
> > Programs have basically no use for the distinction.  In either case,
> > the situation is the same.  The written data is safely in RAM and cannot
> > be written to the storage.  If one were to make superhuman efforts,
> > one could mmap the file and write() it to a different device, but that
> > is incredibly rare.  For most programs, the response is to just die and
> > let the human deal with the corrupted file.
> > 
> > From a sysadmin point of view, of course the situation is different,
> > and the remedy is different, but they should be getting that information
> > through a different mechanism than monitoring the errno from every
> > system call.
> > 
> > If we do want to continue to support both EIO and ENOSPC from writeback,
> > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > in after an EIO is set, it only bumps the counter and applications will
> > see EIO, not ENOSPC on fresh calls to fsync().
> 
> 
> No, ENOSPC on writeback can certainly happen with network filesystems.
> NFS and CIFS have no way to reserve space. You wouldn't want to have to
> do an extra RPC on every buffered write. :)

CIFS has a way to reserve space. Look into "allocation size" on create.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeremy Allison

On Mon, Apr 03, 2017 at 01:47:37PM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> > On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > > responses are different.  That probably means you need a separate seq
> > > > number for each, which isn't ideal.
> > > > 
> > > 
> > > I'm not quite convinced that it's really useful to do anything but
> > > report the latest error.
> > > 
> > > But...if we did need to prefer one over another, could we get away with
> > > always reporting -EIO once that error occurs? If so, then we'd still
> > > just need a single sequence counter.
> > 
> > I wonder whether it's even worth supporting both EIO and ENOSPC for a
> > writeback problem.  If I understand correctly, at the time of write(),
> > filesystems check to see if they have enough blocks to satisfy the
> > request, so ENOSPC only comes up in the writeback context for thinly
> > provisioned devices.
> > 
> > Programs have basically no use for the distinction.  In either case,
> > the situation is the same.  The written data is safely in RAM and cannot
> > be written to the storage.  If one were to make superhuman efforts,
> > one could mmap the file and write() it to a different device, but that
> > is incredibly rare.  For most programs, the response is to just die and
> > let the human deal with the corrupted file.
> > 
> > From a sysadmin point of view, of course the situation is different,
> > and the remedy is different, but they should be getting that information
> > through a different mechanism than monitoring the errno from every
> > system call.
> > 
> > If we do want to continue to support both EIO and ENOSPC from writeback,
> > then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> > in after an EIO is set, it only bumps the counter and applications will
> > see EIO, not ENOSPC on fresh calls to fsync().
> 
> 
> No, ENOSPC on writeback can certainly happen with network filesystems.
> NFS and CIFS have no way to reserve space. You wouldn't want to have to
> do an extra RPC on every buffered write. :)

CIFS has a way to reserve space. Look into "allocation size" on create.

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > responses are different.  That probably means you need a separate seq
> > > number for each, which isn't ideal.
> > > 
> > 
> > I'm not quite convinced that it's really useful to do anything but
> > report the latest error.
> > 
> > But...if we did need to prefer one over another, could we get away with
> > always reporting -EIO once that error occurs? If so, then we'd still
> > just need a single sequence counter.
> 
> I wonder whether it's even worth supporting both EIO and ENOSPC for a
> writeback problem.  If I understand correctly, at the time of write(),
> filesystems check to see if they have enough blocks to satisfy the
> request, so ENOSPC only comes up in the writeback context for thinly
> provisioned devices.
> 
> Programs have basically no use for the distinction.  In either case,
> the situation is the same.  The written data is safely in RAM and cannot
> be written to the storage.  If one were to make superhuman efforts,
> one could mmap the file and write() it to a different device, but that
> is incredibly rare.  For most programs, the response is to just die and
> let the human deal with the corrupted file.
> 
> From a sysadmin point of view, of course the situation is different,
> and the remedy is different, but they should be getting that information
> through a different mechanism than monitoring the errno from every
> system call.
> 
> If we do want to continue to support both EIO and ENOSPC from writeback,
> then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> in after an EIO is set, it only bumps the counter and applications will
> see EIO, not ENOSPC on fresh calls to fsync().


No, ENOSPC on writeback can certainly happen with network filesystems.
NFS and CIFS have no way to reserve space. You wouldn't want to have to
do an extra RPC on every buffered write. :)
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 07:32 -0700, Matthew Wilcox wrote:
> On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> > On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > > Also I think that EIO should always over-ride ENOSPC as the possible
> > > responses are different.  That probably means you need a separate seq
> > > number for each, which isn't ideal.
> > > 
> > 
> > I'm not quite convinced that it's really useful to do anything but
> > report the latest error.
> > 
> > But...if we did need to prefer one over another, could we get away with
> > always reporting -EIO once that error occurs? If so, then we'd still
> > just need a single sequence counter.
> 
> I wonder whether it's even worth supporting both EIO and ENOSPC for a
> writeback problem.  If I understand correctly, at the time of write(),
> filesystems check to see if they have enough blocks to satisfy the
> request, so ENOSPC only comes up in the writeback context for thinly
> provisioned devices.
> 
> Programs have basically no use for the distinction.  In either case,
> the situation is the same.  The written data is safely in RAM and cannot
> be written to the storage.  If one were to make superhuman efforts,
> one could mmap the file and write() it to a different device, but that
> is incredibly rare.  For most programs, the response is to just die and
> let the human deal with the corrupted file.
> 
> From a sysadmin point of view, of course the situation is different,
> and the remedy is different, but they should be getting that information
> through a different mechanism than monitoring the errno from every
> system call.
> 
> If we do want to continue to support both EIO and ENOSPC from writeback,
> then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
> in after an EIO is set, it only bumps the counter and applications will
> see EIO, not ENOSPC on fresh calls to fsync().


No, ENOSPC on writeback can certainly happen with network filesystems.
NFS and CIFS have no way to reserve space. You wouldn't want to have to
do an extra RPC on every buffered write. :)
-- 
Jeff Layton

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 02:25:11PM +1000, NeilBrown wrote:
> I don't like that you need to add a 'flush' handler to every filesystem,
> most of which just call
>  +return filemap_report_wb_error(file);
> 
> Could we just have
>   if (filp->f_op->flush)
>   retval = filp->f_op->flush(filp, id);
> + else
> + retval = filemap_report_wb_error(filp);
> in flip_close() ??

Maybe this is badly named as ext4_flush_file().  Maybe this should be
generic_flush_file(), and then there's no per-filesystem overhead to this?

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 02:25:11PM +1000, NeilBrown wrote:
> I don't like that you need to add a 'flush' handler to every filesystem,
> most of which just call
>  +return filemap_report_wb_error(file);
> 
> Could we just have
>   if (filp->f_op->flush)
>   retval = filp->f_op->flush(filp, id);
> + else
> + retval = filemap_report_wb_error(filp);
> in flip_close() ??

Maybe this is badly named as ext4_flush_file().  Maybe this should be
generic_flush_file(), and then there's no per-filesystem overhead to this?

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > Also I think that EIO should always over-ride ENOSPC as the possible
> > responses are different.  That probably means you need a separate seq
> > number for each, which isn't ideal.
> > 
> 
> I'm not quite convinced that it's really useful to do anything but
> report the latest error.
> 
> But...if we did need to prefer one over another, could we get away with
> always reporting -EIO once that error occurs? If so, then we'd still
> just need a single sequence counter.

I wonder whether it's even worth supporting both EIO and ENOSPC for a
writeback problem.  If I understand correctly, at the time of write(),
filesystems check to see if they have enough blocks to satisfy the
request, so ENOSPC only comes up in the writeback context for thinly
provisioned devices.

Programs have basically no use for the distinction.  In either case,
the situation is the same.  The written data is safely in RAM and cannot
be written to the storage.  If one were to make superhuman efforts,
one could mmap the file and write() it to a different device, but that
is incredibly rare.  For most programs, the response is to just die and
let the human deal with the corrupted file.

>From a sysadmin point of view, of course the situation is different,
and the remedy is different, but they should be getting that information
through a different mechanism than monitoring the errno from every
system call.

If we do want to continue to support both EIO and ENOSPC from writeback,
then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
in after an EIO is set, it only bumps the counter and applications will
see EIO, not ENOSPC on fresh calls to fsync().

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Matthew Wilcox

On Mon, Apr 03, 2017 at 06:28:38AM -0400, Jeff Layton wrote:
> On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> > Also I think that EIO should always over-ride ENOSPC as the possible
> > responses are different.  That probably means you need a separate seq
> > number for each, which isn't ideal.
> > 
> 
> I'm not quite convinced that it's really useful to do anything but
> report the latest error.
> 
> But...if we did need to prefer one over another, could we get away with
> always reporting -EIO once that error occurs? If so, then we'd still
> just need a single sequence counter.

I wonder whether it's even worth supporting both EIO and ENOSPC for a
writeback problem.  If I understand correctly, at the time of write(),
filesystems check to see if they have enough blocks to satisfy the
request, so ENOSPC only comes up in the writeback context for thinly
provisioned devices.

Programs have basically no use for the distinction.  In either case,
the situation is the same.  The written data is safely in RAM and cannot
be written to the storage.  If one were to make superhuman efforts,
one could mmap the file and write() it to a different device, but that
is incredibly rare.  For most programs, the response is to just die and
let the human deal with the corrupted file.

>From a sysadmin point of view, of course the situation is different,
and the remedy is different, but they should be getting that information
through a different mechanism than monitoring the errno from every
system call.

If we do want to continue to support both EIO and ENOSPC from writeback,
then let's have EIO override ENOSPC as an error.  ie if an ENOSPC comes
in after an EIO is set, it only bumps the counter and applications will
see EIO, not ENOSPC on fresh calls to fsync().

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> On Fri, Mar 31 2017, Jeff Layton wrote:
> 
> > During LSF/MM this year, we had a discussion about the current sorry
> > state of writeback error reporting, and what could be done to improve
> > the situation. This patchset represents a first pass at the proposal
> > I made there.
> > 
> > It first adds a new set of writeback error tracking infrastructure to
> > ensure that errors are properly stored and reported at fsync time. It
> > also makes a small but significant change to ensure that writeback
> > errors are reported on all file descriptors, not just on the first one
> > where fsync is called.
> > 
> > Note that this is a _very_ rough draft at this point. I did some by-hand
> > testing with dm-error to ensure that it does the right thing there.
> > Mostly I'm interested in early feedback at this point -- does this basic
> > approach make sense?
> 
> I think that having ->wb_err_seq and returning errors to all file
> descriptors is a good idea.
> I don't like ->wb_err, particularly that you allow it to be set
> to zero:
>  +/*
>  + * This should be called with the error code that we want to return
>  + * on fsync. Thus, it should always be <= 0.
>  + */
>  +WARN_ON(err > 0);
> 
> Why is that ??
> 

It's because I wasn't thinking about all of the places that currently
call mapping_set_error with an error of 0. This worked for ext4 since we
only call this when there is an actual error. You're correct here -- we
should only set the error when it's non-zero. I'll fix that.

> Also I think that EIO should always over-ride ENOSPC as the possible
> responses are different.  That probably means you need a separate seq
> number for each, which isn't ideal.
> 

I'm not quite convinced that it's really useful to do anything but
report the latest error.

But...if we did need to prefer one over another, could we get away with
always reporting -EIO once that error occurs? If so, then we'd still
just need a single sequence counter.

> I don't like that you need to add a 'flush' handler to every filesystem,
> most of which just call
>  +return filemap_report_wb_error(file);
> 
> Could we just have
>   if (filp->f_op->flush)
>   retval = filp->f_op->flush(filp, id);
> + else
> + retval = filemap_report_wb_error(filp);
> in flip_close() ??
> 

Sure, that's possible.

I'm leery of making too much in the way of changes to the generic VFS
layer code just yet. After making several abortive attempts to try to
fix some of this with large, sweeping changes to the code, I think the
approach of doing this on a per-filesystem basis will be saner.

My concern there for now is that some code (e.g. fs/buffer.c) is shared
between filesystems and will need to call both routines in the interim.
Suppose we have a filesystem (ext2?) that is using the older routines
for now. Making the change to filp_close above might subtly change its
behavior, and I don't think we want to do that.

Once we have everything converted to use the newer API, we should be
able to collapse a lot of the flush routines into the above though.

> ... or maybe it is wrong to return this error on close().
> After all, the file actually does get closed, so no error occurred.
> If an application cares about EIO, it should always call fsync() before
> close().
> 

Applications should, but the close(2) manpage does say:

   Not  checking  the return value of close() is a common
   but nevertheless serious  programming  error.   It  is
   quite  possible  that  errors  on  a previous write(2)
   operation are first reported  at  the  final  close().
   Not  checking  the  return value when closing the file
   may lead to silent loss of data.

POSIX seems to say that that behavior is optional, but I think reporting
errors at close is a good idea. There are programs that do check for
that, but whether they do anything useful with the error is a little
less clear.

> > 
> > Jeff Layton (4):
> >   fs: new infrastructure for writeback error handling and reporting
> >   dax: set errors in mapping when writeback fails
> >   buffer: set wb errors using both new and old infrastructure for now
> >   ext4: wire it up to the new writeback error reporting infrastructure
> > 
> >  Documentation/filesystems/vfs.txt | 14 +++--
> >  fs/buffer.c   |  6 +++-
> >  fs/dax.c  |  4 ++-
> >  fs/ext4/dir.c |  1 +
> >  fs/ext4/ext4.h|  1 +
> >  fs/ext4/file.c|  1 +
> >  fs/ext4/fsync.c   | 15 +++---
> >  fs/ext4/inode.c   |  2 +-
> >  fs/ext4/page-io.c |  4 +--
> >  fs/open.c |  3 ++
> >  include/linux/fs.h|  5 
> >  mm/filemap.c  | 61 
> > +++
> >  12 files changed, 106 insertions(+), 11 deletions(-)
> > 
>

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-03 Thread Jeff Layton

On Mon, 2017-04-03 at 14:25 +1000, NeilBrown wrote:
> On Fri, Mar 31 2017, Jeff Layton wrote:
> 
> > During LSF/MM this year, we had a discussion about the current sorry
> > state of writeback error reporting, and what could be done to improve
> > the situation. This patchset represents a first pass at the proposal
> > I made there.
> > 
> > It first adds a new set of writeback error tracking infrastructure to
> > ensure that errors are properly stored and reported at fsync time. It
> > also makes a small but significant change to ensure that writeback
> > errors are reported on all file descriptors, not just on the first one
> > where fsync is called.
> > 
> > Note that this is a _very_ rough draft at this point. I did some by-hand
> > testing with dm-error to ensure that it does the right thing there.
> > Mostly I'm interested in early feedback at this point -- does this basic
> > approach make sense?
> 
> I think that having ->wb_err_seq and returning errors to all file
> descriptors is a good idea.
> I don't like ->wb_err, particularly that you allow it to be set
> to zero:
>  +/*
>  + * This should be called with the error code that we want to return
>  + * on fsync. Thus, it should always be <= 0.
>  + */
>  +WARN_ON(err > 0);
> 
> Why is that ??
> 

It's because I wasn't thinking about all of the places that currently
call mapping_set_error with an error of 0. This worked for ext4 since we
only call this when there is an actual error. You're correct here -- we
should only set the error when it's non-zero. I'll fix that.

> Also I think that EIO should always over-ride ENOSPC as the possible
> responses are different.  That probably means you need a separate seq
> number for each, which isn't ideal.
> 

I'm not quite convinced that it's really useful to do anything but
report the latest error.

But...if we did need to prefer one over another, could we get away with
always reporting -EIO once that error occurs? If so, then we'd still
just need a single sequence counter.

> I don't like that you need to add a 'flush' handler to every filesystem,
> most of which just call
>  +return filemap_report_wb_error(file);
> 
> Could we just have
>   if (filp->f_op->flush)
>   retval = filp->f_op->flush(filp, id);
> + else
> + retval = filemap_report_wb_error(filp);
> in flip_close() ??
> 

Sure, that's possible.

I'm leery of making too much in the way of changes to the generic VFS
layer code just yet. After making several abortive attempts to try to
fix some of this with large, sweeping changes to the code, I think the
approach of doing this on a per-filesystem basis will be saner.

My concern there for now is that some code (e.g. fs/buffer.c) is shared
between filesystems and will need to call both routines in the interim.
Suppose we have a filesystem (ext2?) that is using the older routines
for now. Making the change to filp_close above might subtly change its
behavior, and I don't think we want to do that.

Once we have everything converted to use the newer API, we should be
able to collapse a lot of the flush routines into the above though.

> ... or maybe it is wrong to return this error on close().
> After all, the file actually does get closed, so no error occurred.
> If an application cares about EIO, it should always call fsync() before
> close().
> 

Applications should, but the close(2) manpage does say:

   Not  checking  the return value of close() is a common
   but nevertheless serious  programming  error.   It  is
   quite  possible  that  errors  on  a previous write(2)
   operation are first reported  at  the  final  close().
   Not  checking  the  return value when closing the file
   may lead to silent loss of data.

POSIX seems to say that that behavior is optional, but I think reporting
errors at close is a good idea. There are programs that do check for
that, but whether they do anything useful with the error is a little
less clear.

> > 
> > Jeff Layton (4):
> >   fs: new infrastructure for writeback error handling and reporting
> >   dax: set errors in mapping when writeback fails
> >   buffer: set wb errors using both new and old infrastructure for now
> >   ext4: wire it up to the new writeback error reporting infrastructure
> > 
> >  Documentation/filesystems/vfs.txt | 14 +++--
> >  fs/buffer.c   |  6 +++-
> >  fs/dax.c  |  4 ++-
> >  fs/ext4/dir.c |  1 +
> >  fs/ext4/ext4.h|  1 +
> >  fs/ext4/file.c|  1 +
> >  fs/ext4/fsync.c   | 15 +++---
> >  fs/ext4/inode.c   |  2 +-
> >  fs/ext4/page-io.c |  4 +--
> >  fs/open.c |  3 ++
> >  include/linux/fs.h|  5 
> >  mm/filemap.c  | 61 
> > +++
> >  12 files changed, 106 insertions(+), 11 deletions(-)
> > 
>

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-02 Thread NeilBrown

On Fri, Mar 31 2017, Jeff Layton wrote:

> During LSF/MM this year, we had a discussion about the current sorry
> state of writeback error reporting, and what could be done to improve
> the situation. This patchset represents a first pass at the proposal
> I made there.
>
> It first adds a new set of writeback error tracking infrastructure to
> ensure that errors are properly stored and reported at fsync time. It
> also makes a small but significant change to ensure that writeback
> errors are reported on all file descriptors, not just on the first one
> where fsync is called.
>
> Note that this is a _very_ rough draft at this point. I did some by-hand
> testing with dm-error to ensure that it does the right thing there.
> Mostly I'm interested in early feedback at this point -- does this basic
> approach make sense?

I think that having ->wb_err_seq and returning errors to all file
descriptors is a good idea.
I don't like ->wb_err, particularly that you allow it to be set
to zero:
 +  /*
 +   * This should be called with the error code that we want to return
 +   * on fsync. Thus, it should always be <= 0.
 +   */
 +  WARN_ON(err > 0);

Why is that ??

Also I think that EIO should always over-ride ENOSPC as the possible
responses are different.  That probably means you need a separate seq
number for each, which isn't ideal.

I don't like that you need to add a 'flush' handler to every filesystem,
most of which just call
 +  return filemap_report_wb_error(file);

Could we just have
if (filp->f_op->flush)
retval = filp->f_op->flush(filp, id);
+   else
+   retval = filemap_report_wb_error(filp);
in flip_close() ??

... or maybe it is wrong to return this error on close().
After all, the file actually does get closed, so no error occurred.
If an application cares about EIO, it should always call fsync() before
close().

Thanks,
NeilBrown


>
> Jeff Layton (4):
>   fs: new infrastructure for writeback error handling and reporting
>   dax: set errors in mapping when writeback fails
>   buffer: set wb errors using both new and old infrastructure for now
>   ext4: wire it up to the new writeback error reporting infrastructure
>
>  Documentation/filesystems/vfs.txt | 14 +++--
>  fs/buffer.c   |  6 +++-
>  fs/dax.c  |  4 ++-
>  fs/ext4/dir.c |  1 +
>  fs/ext4/ext4.h|  1 +
>  fs/ext4/file.c|  1 +
>  fs/ext4/fsync.c   | 15 +++---
>  fs/ext4/inode.c   |  2 +-
>  fs/ext4/page-io.c |  4 +--
>  fs/open.c |  3 ++
>  include/linux/fs.h|  5 
>  mm/filemap.c  | 61 
> +++
>  12 files changed, 106 insertions(+), 11 deletions(-)
>
> -- 
> 2.9.3


signature.asc
Description: PGP signature

Re: [RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-04-02 Thread NeilBrown

On Fri, Mar 31 2017, Jeff Layton wrote:

> During LSF/MM this year, we had a discussion about the current sorry
> state of writeback error reporting, and what could be done to improve
> the situation. This patchset represents a first pass at the proposal
> I made there.
>
> It first adds a new set of writeback error tracking infrastructure to
> ensure that errors are properly stored and reported at fsync time. It
> also makes a small but significant change to ensure that writeback
> errors are reported on all file descriptors, not just on the first one
> where fsync is called.
>
> Note that this is a _very_ rough draft at this point. I did some by-hand
> testing with dm-error to ensure that it does the right thing there.
> Mostly I'm interested in early feedback at this point -- does this basic
> approach make sense?

I think that having ->wb_err_seq and returning errors to all file
descriptors is a good idea.
I don't like ->wb_err, particularly that you allow it to be set
to zero:
 +  /*
 +   * This should be called with the error code that we want to return
 +   * on fsync. Thus, it should always be <= 0.
 +   */
 +  WARN_ON(err > 0);

Why is that ??

Also I think that EIO should always over-ride ENOSPC as the possible
responses are different.  That probably means you need a separate seq
number for each, which isn't ideal.

I don't like that you need to add a 'flush' handler to every filesystem,
most of which just call
 +  return filemap_report_wb_error(file);

Could we just have
if (filp->f_op->flush)
retval = filp->f_op->flush(filp, id);
+   else
+   retval = filemap_report_wb_error(filp);
in flip_close() ??

... or maybe it is wrong to return this error on close().
After all, the file actually does get closed, so no error occurred.
If an application cares about EIO, it should always call fsync() before
close().

Thanks,
NeilBrown


>
> Jeff Layton (4):
>   fs: new infrastructure for writeback error handling and reporting
>   dax: set errors in mapping when writeback fails
>   buffer: set wb errors using both new and old infrastructure for now
>   ext4: wire it up to the new writeback error reporting infrastructure
>
>  Documentation/filesystems/vfs.txt | 14 +++--
>  fs/buffer.c   |  6 +++-
>  fs/dax.c  |  4 ++-
>  fs/ext4/dir.c |  1 +
>  fs/ext4/ext4.h|  1 +
>  fs/ext4/file.c|  1 +
>  fs/ext4/fsync.c   | 15 +++---
>  fs/ext4/inode.c   |  2 +-
>  fs/ext4/page-io.c |  4 +--
>  fs/open.c |  3 ++
>  include/linux/fs.h|  5 
>  mm/filemap.c  | 61 
> +++
>  12 files changed, 106 insertions(+), 11 deletions(-)
>
> -- 
> 2.9.3


signature.asc
Description: PGP signature

[RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-03-31 Thread Jeff Layton

During LSF/MM this year, we had a discussion about the current sorry
state of writeback error reporting, and what could be done to improve
the situation. This patchset represents a first pass at the proposal
I made there.

It first adds a new set of writeback error tracking infrastructure to
ensure that errors are properly stored and reported at fsync time. It
also makes a small but significant change to ensure that writeback
errors are reported on all file descriptors, not just on the first one
where fsync is called.

Note that this is a _very_ rough draft at this point. I did some by-hand
testing with dm-error to ensure that it does the right thing there.
Mostly I'm interested in early feedback at this point -- does this basic
approach make sense?

Jeff Layton (4):
  fs: new infrastructure for writeback error handling and reporting
  dax: set errors in mapping when writeback fails
  buffer: set wb errors using both new and old infrastructure for now
  ext4: wire it up to the new writeback error reporting infrastructure

 Documentation/filesystems/vfs.txt | 14 +++--
 fs/buffer.c   |  6 +++-
 fs/dax.c  |  4 ++-
 fs/ext4/dir.c |  1 +
 fs/ext4/ext4.h|  1 +
 fs/ext4/file.c|  1 +
 fs/ext4/fsync.c   | 15 +++---
 fs/ext4/inode.c   |  2 +-
 fs/ext4/page-io.c |  4 +--
 fs/open.c |  3 ++
 include/linux/fs.h|  5 
 mm/filemap.c  | 61 +++
 12 files changed, 106 insertions(+), 11 deletions(-)

-- 
2.9.3

[RFC PATCH 0/4] fs: introduce new writeback error tracking infrastructure and convert ext4 to use it

2017-03-31 Thread Jeff Layton

During LSF/MM this year, we had a discussion about the current sorry
state of writeback error reporting, and what could be done to improve
the situation. This patchset represents a first pass at the proposal
I made there.

It first adds a new set of writeback error tracking infrastructure to
ensure that errors are properly stored and reported at fsync time. It
also makes a small but significant change to ensure that writeback
errors are reported on all file descriptors, not just on the first one
where fsync is called.

Note that this is a _very_ rough draft at this point. I did some by-hand
testing with dm-error to ensure that it does the right thing there.
Mostly I'm interested in early feedback at this point -- does this basic
approach make sense?

Jeff Layton (4):
  fs: new infrastructure for writeback error handling and reporting
  dax: set errors in mapping when writeback fails
  buffer: set wb errors using both new and old infrastructure for now
  ext4: wire it up to the new writeback error reporting infrastructure

 Documentation/filesystems/vfs.txt | 14 +++--
 fs/buffer.c   |  6 +++-
 fs/dax.c  |  4 ++-
 fs/ext4/dir.c |  1 +
 fs/ext4/ext4.h|  1 +
 fs/ext4/file.c|  1 +
 fs/ext4/fsync.c   | 15 +++---
 fs/ext4/inode.c   |  2 +-
 fs/ext4/page-io.c |  4 +--
 fs/open.c |  3 ++
 include/linux/fs.h|  5 
 mm/filemap.c  | 61 +++
 12 files changed, 106 insertions(+), 11 deletions(-)

-- 
2.9.3

88 matches

Mail list logo