Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dan Williams
On Tue, May 3, 2016 at 8:18 PM, Dave Chinner  wrote:
> On Tue, May 03, 2016 at 10:28:15AM -0700, Dan Williams wrote:
>> On Mon, May 2, 2016 at 6:51 PM, Dave Chinner  wrote:
>> > On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
>> [..]
>> > Yes, I know, and it doesn't answer any of the questions I just
>> > asked. What you just told me is that there is something that is kept
>> > three levels of abstraction away from a filesystem. So:
>>
>> Ok, let's answer them.
>>
>> A lot of your questions seem to assume the filesystem has a leading
>> role to play with error recovery, that isn't the case with traditional
>> disk errors and we're not looking to change that situation.
>
> *cough* BTRFS
>
> New filesystems are mostly being designed with redundancy and
> recovery mechanisms built into them. Hence the high level
> /assumption/ that filesystems aren't going to play a significant
> role in error recovery for pmem storage is, well, somewhat
> disturbing

It is unfortunate is that you cite the lack of pmem enabling in btrfs
as a reason to block patches that hookup the kernel's existing error
mechanisms for the DAX case.  I expect btrfs multi-device
redundancy-management for pmem to be a more a coherent solution than
what we can achieve with single-device-filesystems + RAID.  I'm trying
not to boil the ocean in this discussion, but Iet's go ahead and rope
in btrfs-devel into this thread so we can make progress on hooking up
SIGBUS notifications for DAX errors and bypassing dax_do_io() to clear
errors.

>> The
>> filesystem can help with forensics after an error escapes the kernel
>> and is communicated to userspace, but the ability to reverse map a
>> sector to a file is just a convenience to identify potential data
>> loss.
>
> So working out what file got corrupted in your terabytes of pmem
> storage is "just a convenience"? I suspect that a rather large
> percentage of admins will disagree with you on this.

Yes, I will point them to their file system maintainer to ask about
reverse mapping support.

>> For redundancy in the DAX case I can envision DAX-aware RAID that
>> makes the potential exposure to bad blocks smaller, but it will always
>> be the case that the array can be out-of-sync / degraded and has no
>> choice but to communicate the error to userspace.  So, the answers
>> below address what we do when we are in that state, and include some
>> thoughts about follow-on enabling we can do at the DM/MD layer.
>>
>> > - What mechanism is to be used for the underlying block
>> >   device to inform the filesytem that a new bad block was
>> >   added to this list?
>>
>> The filesystem doesn't need this notification and doesn't get it today
>> from RAID.
>
> Why doesn't the filesystem need this notification? Just because we
> don't get it today from a RAID device does not mean we can't use it.

If xfs and ext4 had a use for error notification today we would hook into it.

> Indeed, think about the btrfs scrub operation - it validates
> everything on it's individual block devices, and when it finds a
> problem (e.g. a data CRC error) it notifies a different layer in the
> btrfs code that goes and works out if it can repair the problem from
> redundant copies/parity/mirrors/etc.

Yes, just like RAID, sounds like we should definitely keep that in
mind for the patch set that adds pmem support to btrfs, this isn't
that patch set.

>> It's handy for the bad block list to be available to
>> fs/dax.c and the block layer, but I don't see ext4/xfs having a role
>> to play with the list and certainly not care about "new error detected
>> events".
>
> That's very short-sighted. Just because ext4/xfs don't *currently*
> do this, it doesn't mean other filesystems (existing or new) can't
> make use of notifications, nor that ext4/XFS can't ever make use of
> it, either.

Did I say "can't ever make use of it", no, if you have a need for a
notification for xfs let's work on a notification mechanism.

>
>> For a DM/MD driver it also does not need to know about new
>> errors because it will follow the traditional disk model where errors
>> are handled on access, or discovered and scrubbed during a periodic
>> array scan.
>>
>> That said, new errors may get added to the list by having the pmem
>> driver trigger a rescan of the device whenever a latent error is
>> discovered (i.e. memcpy_from_pmem() returns -EIO).  The update of the
>> bad block list is asynchronous.  We also have a task on the todo list
>> to allow the pmem rescan action to be triggered via sysfs.
>
> IOWs, the pmem driver won't report errors to anyone who can correct
> them until an access to that bad block is made? Even if it means the
> error might go unreported and hence uncorrected for weeks or months
> because no access is made to that bad data?

RAID periodically polls for and fixes bad blocks.  The currently
implementation only polls for errors at driver load.  When we

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dan Williams
On Tue, May 3, 2016 at 8:18 PM, Dave Chinner  wrote:
> On Tue, May 03, 2016 at 10:28:15AM -0700, Dan Williams wrote:
>> On Mon, May 2, 2016 at 6:51 PM, Dave Chinner  wrote:
>> > On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
>> [..]
>> > Yes, I know, and it doesn't answer any of the questions I just
>> > asked. What you just told me is that there is something that is kept
>> > three levels of abstraction away from a filesystem. So:
>>
>> Ok, let's answer them.
>>
>> A lot of your questions seem to assume the filesystem has a leading
>> role to play with error recovery, that isn't the case with traditional
>> disk errors and we're not looking to change that situation.
>
> *cough* BTRFS
>
> New filesystems are mostly being designed with redundancy and
> recovery mechanisms built into them. Hence the high level
> /assumption/ that filesystems aren't going to play a significant
> role in error recovery for pmem storage is, well, somewhat
> disturbing

It is unfortunate is that you cite the lack of pmem enabling in btrfs
as a reason to block patches that hookup the kernel's existing error
mechanisms for the DAX case.  I expect btrfs multi-device
redundancy-management for pmem to be a more a coherent solution than
what we can achieve with single-device-filesystems + RAID.  I'm trying
not to boil the ocean in this discussion, but Iet's go ahead and rope
in btrfs-devel into this thread so we can make progress on hooking up
SIGBUS notifications for DAX errors and bypassing dax_do_io() to clear
errors.

>> The
>> filesystem can help with forensics after an error escapes the kernel
>> and is communicated to userspace, but the ability to reverse map a
>> sector to a file is just a convenience to identify potential data
>> loss.
>
> So working out what file got corrupted in your terabytes of pmem
> storage is "just a convenience"? I suspect that a rather large
> percentage of admins will disagree with you on this.

Yes, I will point them to their file system maintainer to ask about
reverse mapping support.

>> For redundancy in the DAX case I can envision DAX-aware RAID that
>> makes the potential exposure to bad blocks smaller, but it will always
>> be the case that the array can be out-of-sync / degraded and has no
>> choice but to communicate the error to userspace.  So, the answers
>> below address what we do when we are in that state, and include some
>> thoughts about follow-on enabling we can do at the DM/MD layer.
>>
>> > - What mechanism is to be used for the underlying block
>> >   device to inform the filesytem that a new bad block was
>> >   added to this list?
>>
>> The filesystem doesn't need this notification and doesn't get it today
>> from RAID.
>
> Why doesn't the filesystem need this notification? Just because we
> don't get it today from a RAID device does not mean we can't use it.

If xfs and ext4 had a use for error notification today we would hook into it.

> Indeed, think about the btrfs scrub operation - it validates
> everything on it's individual block devices, and when it finds a
> problem (e.g. a data CRC error) it notifies a different layer in the
> btrfs code that goes and works out if it can repair the problem from
> redundant copies/parity/mirrors/etc.

Yes, just like RAID, sounds like we should definitely keep that in
mind for the patch set that adds pmem support to btrfs, this isn't
that patch set.

>> It's handy for the bad block list to be available to
>> fs/dax.c and the block layer, but I don't see ext4/xfs having a role
>> to play with the list and certainly not care about "new error detected
>> events".
>
> That's very short-sighted. Just because ext4/xfs don't *currently*
> do this, it doesn't mean other filesystems (existing or new) can't
> make use of notifications, nor that ext4/XFS can't ever make use of
> it, either.

Did I say "can't ever make use of it", no, if you have a need for a
notification for xfs let's work on a notification mechanism.

>
>> For a DM/MD driver it also does not need to know about new
>> errors because it will follow the traditional disk model where errors
>> are handled on access, or discovered and scrubbed during a periodic
>> array scan.
>>
>> That said, new errors may get added to the list by having the pmem
>> driver trigger a rescan of the device whenever a latent error is
>> discovered (i.e. memcpy_from_pmem() returns -EIO).  The update of the
>> bad block list is asynchronous.  We also have a task on the todo list
>> to allow the pmem rescan action to be triggered via sysfs.
>
> IOWs, the pmem driver won't report errors to anyone who can correct
> them until an access to that bad block is made? Even if it means the
> error might go unreported and hence uncorrected for weeks or months
> because no access is made to that bad data?

RAID periodically polls for and fixes bad blocks.  The currently
implementation only polls for errors at driver load.  When we
implement userspace triggered bad blocks 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dave Chinner
On Tue, May 03, 2016 at 10:28:15AM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 6:51 PM, Dave Chinner  wrote:
> > On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
> [..]
> > Yes, I know, and it doesn't answer any of the questions I just
> > asked. What you just told me is that there is something that is kept
> > three levels of abstraction away from a filesystem. So:
> 
> Ok, let's answer them.
> 
> A lot of your questions seem to assume the filesystem has a leading
> role to play with error recovery, that isn't the case with traditional
> disk errors and we're not looking to change that situation. 

*cough* BTRFS

New filesystems are mostly being designed with redundancy and
recovery mechanisms built into them. Hence the high level
/assumption/ that filesystems aren't going to play a significant
role in error recovery for pmem storage is, well, somewhat
disturbing

> The
> filesystem can help with forensics after an error escapes the kernel
> and is communicated to userspace, but the ability to reverse map a
> sector to a file is just a convenience to identify potential data
> loss.

So working out what file got corrupted in your terabytes of pmem
storage is "just a convenience"? I suspect that a rather large
percentage of admins will disagree with you on this.

> For redundancy in the DAX case I can envision DAX-aware RAID that
> makes the potential exposure to bad blocks smaller, but it will always
> be the case that the array can be out-of-sync / degraded and has no
> choice but to communicate the error to userspace.  So, the answers
> below address what we do when we are in that state, and include some
> thoughts about follow-on enabling we can do at the DM/MD layer.
> 
> > - What mechanism is to be used for the underlying block
> >   device to inform the filesytem that a new bad block was
> >   added to this list?
> 
> The filesystem doesn't need this notification and doesn't get it today
> from RAID.

Why doesn't the filesystem need this notification? Just because we
don't get it today from a RAID device does not mean we can't use it.

Indeed, think about the btrfs scrub operation - it validates
everything on it's individual block devices, and when it finds a
problem (e.g. a data CRC error) it notifies a different layer in the
btrfs code that goes and works out if it can repair the problem from
redundant copies/parity/mirrors/etc.

> It's handy for the bad block list to be available to
> fs/dax.c and the block layer, but I don't see ext4/xfs having a role
> to play with the list and certainly not care about "new error detected
> events". 

That's very short-sighted. Just because ext4/xfs don't *currently*
do this, it doesn't mean other filesystems (existing or new) can't
make use of notifications, nor that ext4/XFS can't ever make use of
it, either.

> For a DM/MD driver it also does not need to know about new
> errors because it will follow the traditional disk model where errors
> are handled on access, or discovered and scrubbed during a periodic
> array scan.
>
> That said, new errors may get added to the list by having the pmem
> driver trigger a rescan of the device whenever a latent error is
> discovered (i.e. memcpy_from_pmem() returns -EIO).  The update of the
> bad block list is asynchronous.  We also have a task on the todo list
> to allow the pmem rescan action to be triggered via sysfs.

IOWs, the pmem driver won't report errors to anyone who can correct
them until an access to that bad block is made? Even if it means the
error might go unreported and hence uncorrected for weeks or months
because no access is made to that bad data?

> >   What context comes along with that
> >   notification?
> 
> The only notification the file system gets is -EIO on access.
> However, assuming we had a DAX-aware RAID driver what infrastructure
> would we need to prevent SIGBUS from reaching the application if we
> happened to have a redundant copy of the data?

We'd need the same infrastructure at the filesystem layer would
require if it has a redundant copy of the data. I don't know what
that is, however, because I know very little about about MCEs and
signal delivery (which is why I asked this question).

[]

> The in-kernel recovery path, assuming RAID is present, needs more
> thought especially considering the limited NMI context of a machine
> check notification and the need to trap back into driver code.

This is precisely the problem I am asking about - I know there is a
limited context, but how exactly is it limited and what can we
actually do from this context? e.g. Can we schedule recovery work on
other CPU cores and wait for it to complete in a MCE notification
handler?

> I see
> the code in fs/dax.c getting involved to translate a
> process-physical-address back to a sector, but otherwise the rest of
> the filesystem need not be involved.

More /assumptions/ about filesystems not containing or being 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dave Chinner
On Tue, May 03, 2016 at 10:28:15AM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 6:51 PM, Dave Chinner  wrote:
> > On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
> [..]
> > Yes, I know, and it doesn't answer any of the questions I just
> > asked. What you just told me is that there is something that is kept
> > three levels of abstraction away from a filesystem. So:
> 
> Ok, let's answer them.
> 
> A lot of your questions seem to assume the filesystem has a leading
> role to play with error recovery, that isn't the case with traditional
> disk errors and we're not looking to change that situation. 

*cough* BTRFS

New filesystems are mostly being designed with redundancy and
recovery mechanisms built into them. Hence the high level
/assumption/ that filesystems aren't going to play a significant
role in error recovery for pmem storage is, well, somewhat
disturbing

> The
> filesystem can help with forensics after an error escapes the kernel
> and is communicated to userspace, but the ability to reverse map a
> sector to a file is just a convenience to identify potential data
> loss.

So working out what file got corrupted in your terabytes of pmem
storage is "just a convenience"? I suspect that a rather large
percentage of admins will disagree with you on this.

> For redundancy in the DAX case I can envision DAX-aware RAID that
> makes the potential exposure to bad blocks smaller, but it will always
> be the case that the array can be out-of-sync / degraded and has no
> choice but to communicate the error to userspace.  So, the answers
> below address what we do when we are in that state, and include some
> thoughts about follow-on enabling we can do at the DM/MD layer.
> 
> > - What mechanism is to be used for the underlying block
> >   device to inform the filesytem that a new bad block was
> >   added to this list?
> 
> The filesystem doesn't need this notification and doesn't get it today
> from RAID.

Why doesn't the filesystem need this notification? Just because we
don't get it today from a RAID device does not mean we can't use it.

Indeed, think about the btrfs scrub operation - it validates
everything on it's individual block devices, and when it finds a
problem (e.g. a data CRC error) it notifies a different layer in the
btrfs code that goes and works out if it can repair the problem from
redundant copies/parity/mirrors/etc.

> It's handy for the bad block list to be available to
> fs/dax.c and the block layer, but I don't see ext4/xfs having a role
> to play with the list and certainly not care about "new error detected
> events". 

That's very short-sighted. Just because ext4/xfs don't *currently*
do this, it doesn't mean other filesystems (existing or new) can't
make use of notifications, nor that ext4/XFS can't ever make use of
it, either.

> For a DM/MD driver it also does not need to know about new
> errors because it will follow the traditional disk model where errors
> are handled on access, or discovered and scrubbed during a periodic
> array scan.
>
> That said, new errors may get added to the list by having the pmem
> driver trigger a rescan of the device whenever a latent error is
> discovered (i.e. memcpy_from_pmem() returns -EIO).  The update of the
> bad block list is asynchronous.  We also have a task on the todo list
> to allow the pmem rescan action to be triggered via sysfs.

IOWs, the pmem driver won't report errors to anyone who can correct
them until an access to that bad block is made? Even if it means the
error might go unreported and hence uncorrected for weeks or months
because no access is made to that bad data?

> >   What context comes along with that
> >   notification?
> 
> The only notification the file system gets is -EIO on access.
> However, assuming we had a DAX-aware RAID driver what infrastructure
> would we need to prevent SIGBUS from reaching the application if we
> happened to have a redundant copy of the data?

We'd need the same infrastructure at the filesystem layer would
require if it has a redundant copy of the data. I don't know what
that is, however, because I know very little about about MCEs and
signal delivery (which is why I asked this question).

[]

> The in-kernel recovery path, assuming RAID is present, needs more
> thought especially considering the limited NMI context of a machine
> check notification and the need to trap back into driver code.

This is precisely the problem I am asking about - I know there is a
limited context, but how exactly is it limited and what can we
actually do from this context? e.g. Can we schedule recovery work on
other CPU cores and wait for it to complete in a MCE notification
handler?

> I see
> the code in fs/dax.c getting involved to translate a
> process-physical-address back to a sector, but otherwise the rest of
> the filesystem need not be involved.

More /assumptions/ about filesystems not containing or being able to
recover from 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dave Chinner
On Tue, May 03, 2016 at 06:30:04PM +, Rudoff, Andy wrote:
> >
> >And when the filesystem says no because the fs devs don't want to
> >have to deal with broken apps because app devs learn that "this is a
> >go fast knob" and data integrity be damned? It's "fsync is slow so I
> >won't use it" all over again...
> ...
> >
> >And, please keep in mind: many application developers will not
> >design for pmem because they also have to support traditional
> >storage backed by page cache. If they use msync(), the app will work
> >on any storage stack, but just be much, much faster on pmem+DAX. So,
> >really, we have to make the msync()-only model work efficiently, so
> >we may as well design for that in the first place
> 
> Both of these snippets seem to be arguing that we should make msync/fsync
> more efficient.  But I don't think anyone is arguing the opposite.  Is
> someone saying we shouldn't make the msync()-only model work efficiently?

Not directly. The argument presented is "we need a flag to avoid
msync, because msync is inefficient", which is followed by "look,
here's numbers that show msync() being slow, so just give us the
flag already". Experience tells me that the moment a workaround is
in place, nobody will go back and try to fix the problem that the
workaround is mitigating.

Now we know that it's the page granularity cache flushing overhead
that causes the performance differential rather than it being caused
by using msync(), we should be looking at ways to reduce the cache
flushing overhead, not completely bypassing it.

> Said another way: the common case for DAX will be applications simply
> following the POSIX model.  open, mmap, msync...  That will work fine
> and of course we should optimize that path as much as possible.  Less
> common are latency-sensitive applications built to leverage to byte-
> addressable nature of pmem.  File systems supporting this model will
> indicate it using a new ioctl that says doing CPU cache flushes is
> sufficient to flush stores to persistence.

You keep saying this whilst ignoring the repeated comments about how
this can not be guaranteed by all filesystems, and hence apps will
not be able to depend on having such behaviour present. The only
guarantee for persistence that an app will be able to rely on is
msync().

> But I don't see how that
> direction is getting turned into an argument against msync() efficiency.

Promoting a model that works around inefficiency rather than solving
it is no different to saying you don't care about fixing the
inefficiency

I've said my piece, I'm not going to waste any more time going
around this circle again.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dave Chinner
On Tue, May 03, 2016 at 06:30:04PM +, Rudoff, Andy wrote:
> >
> >And when the filesystem says no because the fs devs don't want to
> >have to deal with broken apps because app devs learn that "this is a
> >go fast knob" and data integrity be damned? It's "fsync is slow so I
> >won't use it" all over again...
> ...
> >
> >And, please keep in mind: many application developers will not
> >design for pmem because they also have to support traditional
> >storage backed by page cache. If they use msync(), the app will work
> >on any storage stack, but just be much, much faster on pmem+DAX. So,
> >really, we have to make the msync()-only model work efficiently, so
> >we may as well design for that in the first place
> 
> Both of these snippets seem to be arguing that we should make msync/fsync
> more efficient.  But I don't think anyone is arguing the opposite.  Is
> someone saying we shouldn't make the msync()-only model work efficiently?

Not directly. The argument presented is "we need a flag to avoid
msync, because msync is inefficient", which is followed by "look,
here's numbers that show msync() being slow, so just give us the
flag already". Experience tells me that the moment a workaround is
in place, nobody will go back and try to fix the problem that the
workaround is mitigating.

Now we know that it's the page granularity cache flushing overhead
that causes the performance differential rather than it being caused
by using msync(), we should be looking at ways to reduce the cache
flushing overhead, not completely bypassing it.

> Said another way: the common case for DAX will be applications simply
> following the POSIX model.  open, mmap, msync...  That will work fine
> and of course we should optimize that path as much as possible.  Less
> common are latency-sensitive applications built to leverage to byte-
> addressable nature of pmem.  File systems supporting this model will
> indicate it using a new ioctl that says doing CPU cache flushes is
> sufficient to flush stores to persistence.

You keep saying this whilst ignoring the repeated comments about how
this can not be guaranteed by all filesystems, and hence apps will
not be able to depend on having such behaviour present. The only
guarantee for persistence that an app will be able to rely on is
msync().

> But I don't see how that
> direction is getting turned into an argument against msync() efficiency.

Promoting a model that works around inefficiency rather than solving
it is no different to saying you don't care about fixing the
inefficiency

I've said my piece, I'm not going to waste any more time going
around this circle again.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Rudoff, Andy
>
>And when the filesystem says no because the fs devs don't want to
>have to deal with broken apps because app devs learn that "this is a
>go fast knob" and data integrity be damned? It's "fsync is slow so I
>won't use it" all over again...
...
>
>And, please keep in mind: many application developers will not
>design for pmem because they also have to support traditional
>storage backed by page cache. If they use msync(), the app will work
>on any storage stack, but just be much, much faster on pmem+DAX. So,
>really, we have to make the msync()-only model work efficiently, so
>we may as well design for that in the first place

Both of these snippets seem to be arguing that we should make msync/fsync
more efficient.  But I don't think anyone is arguing the opposite.  Is
someone saying we shouldn't make the msync()-only model work efficiently?

Said another way: the common case for DAX will be applications simply
following the POSIX model.  open, mmap, msync...  That will work fine
and of course we should optimize that path as much as possible.  Less
common are latency-sensitive applications built to leverage to byte-
addressable nature of pmem.  File systems supporting this model will
indicate it using a new ioctl that says doing CPU cache flushes is
sufficient to flush stores to persistence.  But I don't see how that
direction is getting turned into an argument against msync() efficiency.

>Which brings up another point: advanced new functionality
>is going to require native pmem filesystems.

I agree there's opportunity for new filesystems (and old) to leverage
what pmem provides.  But the word "require" implies that's the only
way to go and we know that's not the case.  Using ext4+dax to map
pmem into an application allows that application to use the pmem
directly and a good number of software projects are doing exactly that.

-andy



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Rudoff, Andy
>
>And when the filesystem says no because the fs devs don't want to
>have to deal with broken apps because app devs learn that "this is a
>go fast knob" and data integrity be damned? It's "fsync is slow so I
>won't use it" all over again...
...
>
>And, please keep in mind: many application developers will not
>design for pmem because they also have to support traditional
>storage backed by page cache. If they use msync(), the app will work
>on any storage stack, but just be much, much faster on pmem+DAX. So,
>really, we have to make the msync()-only model work efficiently, so
>we may as well design for that in the first place

Both of these snippets seem to be arguing that we should make msync/fsync
more efficient.  But I don't think anyone is arguing the opposite.  Is
someone saying we shouldn't make the msync()-only model work efficiently?

Said another way: the common case for DAX will be applications simply
following the POSIX model.  open, mmap, msync...  That will work fine
and of course we should optimize that path as much as possible.  Less
common are latency-sensitive applications built to leverage to byte-
addressable nature of pmem.  File systems supporting this model will
indicate it using a new ioctl that says doing CPU cache flushes is
sufficient to flush stores to persistence.  But I don't see how that
direction is getting turned into an argument against msync() efficiency.

>Which brings up another point: advanced new functionality
>is going to require native pmem filesystems.

I agree there's opportunity for new filesystems (and old) to leverage
what pmem provides.  But the word "require" implies that's the only
way to go and we know that's not the case.  Using ext4+dax to map
pmem into an application allows that application to use the pmem
directly and a good number of software projects are doing exactly that.

-andy



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dan Williams
On Mon, May 2, 2016 at 6:51 PM, Dave Chinner  wrote:
> On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
[..]
> Yes, I know, and it doesn't answer any of the questions I just
> asked. What you just told me is that there is something that is kept
> three levels of abstraction away from a filesystem. So:

Ok, let's answer them.

A lot of your questions seem to assume the filesystem has a leading
role to play with error recovery, that isn't the case with traditional
disk errors and we're not looking to change that situation.  The
filesystem can help with forensics after an error escapes the kernel
and is communicated to userspace, but the ability to reverse map a
sector to a file is just a convenience to identify potential data
loss.

For redundancy in the DAX case I can envision DAX-aware RAID that
makes the potential exposure to bad blocks smaller, but it will always
be the case that the array can be out-of-sync / degraded and has no
choice but to communicate the error to userspace.  So, the answers
below address what we do when we are in that state, and include some
thoughts about follow-on enabling we can do at the DM/MD layer.

> - What mechanism is to be used for the underlying block
>   device to inform the filesytem that a new bad block was
>   added to this list?

The filesystem doesn't need this notification and doesn't get it today
from RAID.  It's handy for the bad block list to be available to
fs/dax.c and the block layer, but I don't see ext4/xfs having a role
to play with the list and certainly not care about "new error detected
events".  For a DM/MD driver it also does not need to know about new
errors because it will follow the traditional disk model where errors
are handled on access, or discovered and scrubbed during a periodic
array scan.

That said, new errors may get added to the list by having the pmem
driver trigger a rescan of the device whenever a latent error is
discovered (i.e. memcpy_from_pmem() returns -EIO).  The update of the
bad block list is asynchronous.  We also have a task on the todo list
to allow the pmem rescan action to be triggered via sysfs.

>   What context comes along with that
>   notification?

The only notification the file system gets is -EIO on access.
However, assuming we had a DAX-aware RAID driver what infrastructure
would we need to prevent SIGBUS from reaching the application if we
happened to have a redundant copy of the data?

One feature we've talked about for years at LSF/MM but never made any
progress on is a way for a file system to discover and query if the
storage layer can reconstruct data from redundant information.
Assuming we had such an interface there's still the matter of plumbing
a machine check fault through a physical-address-to-sector conversion
and request the block device driver to attempt to provide a redundant
copy.

The in-kernel recovery path, assuming RAID is present, needs more
thought especially considering the limited NMI context of a machine
check notification and the need to trap back into driver code.  I see
the code in fs/dax.c getting involved to translate a
process-physical-address back to a sector, but otherwise the rest of
the filesystem need not be involved.

> - how does the filesystem query the bad block list without
>   adding layering violations?

Why does the file system need to read the list?

Apologies for answering this question with a question, but these
patches don't assume the filesystem will do anything with a bad block
list.

> - when does the filesystem need to query the bad block list?
> - how will the bad block list propagate through DM/MD
>   layers?
> - how does the filesytem the map the bad block to a
>   path/file/offset without reverse mapping - does this error
>   handling interface really imply the filesystem needs to
>   implement brute force scans at notification time?

No, there is no implication that reverse mapping is a requirement.

> - Is the filesystem expectd to find the active application or
>   address_space access that triggered the bad block
>   notification to handle them correctly? (e.g. prevent a
>   page fault from failing because we can recover from the
>   error immediately)

With these patches no, but it would be nice to incrementally add that
ability.  I.e. trap machine check faults on non-anonymous memory and
send a request down the stack to recover the sector if the storage
layer has a redundant copy.  Again, fs/dax.c would need extensions to
do this coordination, but I don't foresee the filesystem getting
involved beyond that point.

> - what exactly is the filesystem supposed to do with the bad
>   block? e.g:
> - is the block persistently bad until the filesystem
>   rewrites it? Over power cycles? Will we get
>   multiple 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-03 Thread Dan Williams
On Mon, May 2, 2016 at 6:51 PM, Dave Chinner  wrote:
> On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
[..]
> Yes, I know, and it doesn't answer any of the questions I just
> asked. What you just told me is that there is something that is kept
> three levels of abstraction away from a filesystem. So:

Ok, let's answer them.

A lot of your questions seem to assume the filesystem has a leading
role to play with error recovery, that isn't the case with traditional
disk errors and we're not looking to change that situation.  The
filesystem can help with forensics after an error escapes the kernel
and is communicated to userspace, but the ability to reverse map a
sector to a file is just a convenience to identify potential data
loss.

For redundancy in the DAX case I can envision DAX-aware RAID that
makes the potential exposure to bad blocks smaller, but it will always
be the case that the array can be out-of-sync / degraded and has no
choice but to communicate the error to userspace.  So, the answers
below address what we do when we are in that state, and include some
thoughts about follow-on enabling we can do at the DM/MD layer.

> - What mechanism is to be used for the underlying block
>   device to inform the filesytem that a new bad block was
>   added to this list?

The filesystem doesn't need this notification and doesn't get it today
from RAID.  It's handy for the bad block list to be available to
fs/dax.c and the block layer, but I don't see ext4/xfs having a role
to play with the list and certainly not care about "new error detected
events".  For a DM/MD driver it also does not need to know about new
errors because it will follow the traditional disk model where errors
are handled on access, or discovered and scrubbed during a periodic
array scan.

That said, new errors may get added to the list by having the pmem
driver trigger a rescan of the device whenever a latent error is
discovered (i.e. memcpy_from_pmem() returns -EIO).  The update of the
bad block list is asynchronous.  We also have a task on the todo list
to allow the pmem rescan action to be triggered via sysfs.

>   What context comes along with that
>   notification?

The only notification the file system gets is -EIO on access.
However, assuming we had a DAX-aware RAID driver what infrastructure
would we need to prevent SIGBUS from reaching the application if we
happened to have a redundant copy of the data?

One feature we've talked about for years at LSF/MM but never made any
progress on is a way for a file system to discover and query if the
storage layer can reconstruct data from redundant information.
Assuming we had such an interface there's still the matter of plumbing
a machine check fault through a physical-address-to-sector conversion
and request the block device driver to attempt to provide a redundant
copy.

The in-kernel recovery path, assuming RAID is present, needs more
thought especially considering the limited NMI context of a machine
check notification and the need to trap back into driver code.  I see
the code in fs/dax.c getting involved to translate a
process-physical-address back to a sector, but otherwise the rest of
the filesystem need not be involved.

> - how does the filesystem query the bad block list without
>   adding layering violations?

Why does the file system need to read the list?

Apologies for answering this question with a question, but these
patches don't assume the filesystem will do anything with a bad block
list.

> - when does the filesystem need to query the bad block list?
> - how will the bad block list propagate through DM/MD
>   layers?
> - how does the filesytem the map the bad block to a
>   path/file/offset without reverse mapping - does this error
>   handling interface really imply the filesystem needs to
>   implement brute force scans at notification time?

No, there is no implication that reverse mapping is a requirement.

> - Is the filesystem expectd to find the active application or
>   address_space access that triggered the bad block
>   notification to handle them correctly? (e.g. prevent a
>   page fault from failing because we can recover from the
>   error immediately)

With these patches no, but it would be nice to incrementally add that
ability.  I.e. trap machine check faults on non-anonymous memory and
send a request down the stack to recover the sector if the storage
layer has a redundant copy.  Again, fs/dax.c would need extensions to
do this coordination, but I don't foresee the filesystem getting
involved beyond that point.

> - what exactly is the filesystem supposed to do with the bad
>   block? e.g:
> - is the block persistently bad until the filesystem
>   rewrites it? Over power cycles? Will we get
>   multiple notifications (e.g. 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Tue, May 03, 2016 at 01:26:46AM +, Rudoff, Andy wrote:
> 
> >> The takeaway is that msync() is 9-10x slower than userspace cache 
> >> management.
> >
> >An alternative viewpoint: that flushing clean cachelines is
> >extremely expensive on Intel CPUs. ;)
> >
> >i.e. Same numbers, different analysis from a different PoV, and
> >that gives a *completely different conclusion*.
> >
> >Think about it for the moment. The hardware inefficiency being
> >demonstrated could be fixed/optimised in the next hardware product
> >cycle(s) and so will eventually go away. OTOH, we'll be stuck with
> >whatever programming model we come up with for the next 30-40 years,
> >and we'll never be able to fix flaws in it because applications will
> >be depending on them. Do we really want to be stuck with a pmem
> >model that is designed around the flaws and deficiencies of ~1st
> >generation hardware?
> 
> Hi Dave,
> 
> Not sure I agree with your completely different conclusion.  (Not sure
> I completely disagree either, but please let me raise some practical
> points.)
> 
> First of all, let's say you're completely right and flushing clean
> cache lines is extremely expensive.  So your solution is to wait for
> the chip to be fixed? 

No, I'm not saying that's the solution - I'm pointing out that if
clean cache line flushing overhead is less of a problem in future,
the optimisations made now will not be necessary. However, we'll be
still stuck with the API/model that has encoded those optimisations
as a necessary thing for applications to know about and do the
correct thing with. I.e. we end up with a library of applications
that are optimised for a problem that no longer exists...

> Remember the model we're putting forward (which
> we're working on documenting, because I fully agree with the lack of
> documentation point you keep raising) requires the application to ASK
> for the file system's permission before assuming flushing from user space
> to persistence is allowed.

And when the filesystem says no because the fs devs don't want to
have to deal with broken apps because app devs learn that "this is a
go fast knob" and data integrity be damned? It's "fsync is slow so I
won't use it" all over again...

> Anyway, I doubt that flushing a clean cache line is extremely expensive.
> Remember the code is building transactions to maintain a consistent
> in-memory data structure in the face of sudden failure like powerloss.
> So it is using the flushes to create store barriers, but not the block-
> based store barriers we're used to in the storage world, but cache-line-
> sized store barriers (usually multiples of cache lines, but most commonly
> smaller than 4k of them).  So I think when you turn a cache line flush
> into an msync(), you're seeing some dirty stuff get flushed before it
> is time to flush it.  I'm not sure though, but certainly we could spend
> more time testing & measuring.

Sure, but is that what Dan was testing? I don't know - he just
presented a bunch of numbers without a description of the workload,
posting the benchmark code, etc. hence I can only *make assumptions*
about what the numbers mean.

I'm somewhat tired of having to make assumptions because nobody is
describing what they are doing sufficiently and then getting called
out for it, or having to ask lots of questions because other people
have made assumptions about how they think something is going to
work without explaining how the dots connect together. It's a waste
of everyone's time to be playing this ass-u-me game...

The fact that nobody has been able to explain the how the overall
model is supposed to work from physical error all the way out to
userspace makes me think that this is all being made up on the spot.
There are big pieces of the picture missing, and nobody seems to be
able to communicate a clear vision of the architecture we are
supposed to be discussing, let alone implementing...

> More importantly, I think it is interesting to decide what we want the
> pmem programming model to be long-term.  I think we want applications to
> just map pmem, do normal stores to it, and assume they are persistent.
> This is quite different from the 30-year-old POSIX Model where msync()
> is required.

Yes, it's different, but we still have to co-ordinate multiple
layers of persistence (i.e. metadata that references the data).

> But I think it is cleaner, easier to understand, and less
> error-prone.  So why doesn't it work that way right now?  Because we're
> finding it impractical.  Using write-through caching for pmem simply
> doesn't perform well, and depending on the platform to flush the CPU
> caches on shutdown/powerfail is not practical yet.  But I think the day
> will come when it is practical.

Right - it's also simply not practical to intercept every userspace
store to ensure the referencing metadata is also persistent, so we
still need synchronisation mechanisms to ensure that such state is
acheived.  Either that, or the entire 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Tue, May 03, 2016 at 01:26:46AM +, Rudoff, Andy wrote:
> 
> >> The takeaway is that msync() is 9-10x slower than userspace cache 
> >> management.
> >
> >An alternative viewpoint: that flushing clean cachelines is
> >extremely expensive on Intel CPUs. ;)
> >
> >i.e. Same numbers, different analysis from a different PoV, and
> >that gives a *completely different conclusion*.
> >
> >Think about it for the moment. The hardware inefficiency being
> >demonstrated could be fixed/optimised in the next hardware product
> >cycle(s) and so will eventually go away. OTOH, we'll be stuck with
> >whatever programming model we come up with for the next 30-40 years,
> >and we'll never be able to fix flaws in it because applications will
> >be depending on them. Do we really want to be stuck with a pmem
> >model that is designed around the flaws and deficiencies of ~1st
> >generation hardware?
> 
> Hi Dave,
> 
> Not sure I agree with your completely different conclusion.  (Not sure
> I completely disagree either, but please let me raise some practical
> points.)
> 
> First of all, let's say you're completely right and flushing clean
> cache lines is extremely expensive.  So your solution is to wait for
> the chip to be fixed? 

No, I'm not saying that's the solution - I'm pointing out that if
clean cache line flushing overhead is less of a problem in future,
the optimisations made now will not be necessary. However, we'll be
still stuck with the API/model that has encoded those optimisations
as a necessary thing for applications to know about and do the
correct thing with. I.e. we end up with a library of applications
that are optimised for a problem that no longer exists...

> Remember the model we're putting forward (which
> we're working on documenting, because I fully agree with the lack of
> documentation point you keep raising) requires the application to ASK
> for the file system's permission before assuming flushing from user space
> to persistence is allowed.

And when the filesystem says no because the fs devs don't want to
have to deal with broken apps because app devs learn that "this is a
go fast knob" and data integrity be damned? It's "fsync is slow so I
won't use it" all over again...

> Anyway, I doubt that flushing a clean cache line is extremely expensive.
> Remember the code is building transactions to maintain a consistent
> in-memory data structure in the face of sudden failure like powerloss.
> So it is using the flushes to create store barriers, but not the block-
> based store barriers we're used to in the storage world, but cache-line-
> sized store barriers (usually multiples of cache lines, but most commonly
> smaller than 4k of them).  So I think when you turn a cache line flush
> into an msync(), you're seeing some dirty stuff get flushed before it
> is time to flush it.  I'm not sure though, but certainly we could spend
> more time testing & measuring.

Sure, but is that what Dan was testing? I don't know - he just
presented a bunch of numbers without a description of the workload,
posting the benchmark code, etc. hence I can only *make assumptions*
about what the numbers mean.

I'm somewhat tired of having to make assumptions because nobody is
describing what they are doing sufficiently and then getting called
out for it, or having to ask lots of questions because other people
have made assumptions about how they think something is going to
work without explaining how the dots connect together. It's a waste
of everyone's time to be playing this ass-u-me game...

The fact that nobody has been able to explain the how the overall
model is supposed to work from physical error all the way out to
userspace makes me think that this is all being made up on the spot.
There are big pieces of the picture missing, and nobody seems to be
able to communicate a clear vision of the architecture we are
supposed to be discussing, let alone implementing...

> More importantly, I think it is interesting to decide what we want the
> pmem programming model to be long-term.  I think we want applications to
> just map pmem, do normal stores to it, and assume they are persistent.
> This is quite different from the 30-year-old POSIX Model where msync()
> is required.

Yes, it's different, but we still have to co-ordinate multiple
layers of persistence (i.e. metadata that references the data).

> But I think it is cleaner, easier to understand, and less
> error-prone.  So why doesn't it work that way right now?  Because we're
> finding it impractical.  Using write-through caching for pmem simply
> doesn't perform well, and depending on the platform to flush the CPU
> caches on shutdown/powerfail is not practical yet.  But I think the day
> will come when it is practical.

Right - it's also simply not practical to intercept every userspace
store to ensure the referencing metadata is also persistent, so we
still need synchronisation mechanisms to ensure that such state is
acheived.  Either that, or the entire 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 4:04 PM, Dave Chinner  wrote:
> > On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
> >> Dave Chinner  writes:
> >>
> >> > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> >> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> >> > You're assuming that only the DAX aware application accesses it's
> >> > files.  users, backup programs, data replicators, fileystem
> >> > re-organisers (e.g.  defragmenters) etc all may access the files and
> >> > they may throw errors. What then?
> >>
> >> I'm not sure how this is any different from regular storage.  If an
> >> application gets EIO, it's up to the app to decide what to do with that.
> >
> > Sure - they'll fail. But the question I'm asking is that if the
> > application that owns the data is supposed to do error recovery,
> > what happens when a 3rd party application hits an error? If that
> > consumes the error, the the app that owns the data won't ever get a
> > chance to correct the error.
> >
> > This is a minefield - a 3rd party app that swallows and clears DAX
> > based IO errors is a data corruption vector. can yo imagine if
> > *grep* did this? The model that is being promoted here effectively
> > allows this sort of behaviour - I don't really think we
> > should be architecting an error recovery strategy that has the
> > capability to go this wrong
> 
> Since when does grep write to a file on error?

That's precisely my point - it doesn't right now because there is no
onus on userspace applications to correct data errors when they are
found.

However, if the accepted model becomes "userspace needs to try to correct
errors in data automatically", the above scenario is a distinct
possiblity. I'm not saying grep will do this - I'm taking
the logical argument being presented to the extreme - but I'm sure
that there will be developers that have enough knowledge to know
they are supposed to do something with errors on pmem devices, but
not have enough knowledge to know the correct things to do.

And then the app mishandles a EINVAL error (or something like that)
and so we end up with buggy userspace apps trying to correct
errors in good data and causing data loss that way.

Do we really want to introduce a data integrity and error recovery
model where this sort of "bug" is a distinct possibly?

> >> >> > There's an implicit assumption that applications will keep redundant
> >> >> > copies of their data at the /application layer/ and be able to
> >> >> > automatically repair it?
> >>
> >> That's one way to do things.  It really depends on the application what
> >> it will do for recovery.
> >>
> >> >> > And then there's the implicit assumption that it will unlink and
> >> >> > free the entire file before writing a new copy
> >>
> >> I think Vishal was referring to restoring from backup.  cp itself will
> >> truncate the file before overwriting, iirc.
> >
> > Which version of cp? what happens if they use --sparse and the error
> > is in a zeroed region? There's so many assumptions about undefined userspace
> > environment, application and user behaviour being made here, and
> > it's all being handwaved away.
> >
> > I'm asking for this to be defined, demonstrated and documented as a
> > working model that cannot be abused and doesn't have holes the size
> > of trucks in it, not handwaving...
> 
> You lost me...  how are these patches abusing the existing semantics
> of -EIO and write to clear?

I haven't said that. I said there are assumptions about how
userspace will handle the error, but they aren't documented
anywhere. "copy a file using cp" is not a robust recovery solution -
it provides no guarantees about how the bad file and regions will be
recycled and the errors cleared. This effectively of puts it all on
the filesystems to deal with, even though you're trying to design an
error handling model that bypasses the filesystems and goes straight
to userspace.

If I can't understand how this is all supposed to work
because none of it is documented, then we have no chance that the
average admin is going to be able to understand it.

> 
> >> >> To summarize, the two cases we want to handle are:
> >> >> 1. Application has inbuilt recovery:
> >> >>   - hits badblock
> >> >>   - figures out it is able to recover the data
> >> >>   - handles SIGBUS or EIO
> >> >>   - does a (sector aligned) write() to restore the data
> >> >
> >> > The "figures out" step here is where >95% of the work we'd have to
> >> > do is. And that's in filesystem and block layer code, not
> >> > userspace, and userspace can't do that work in a signal handler.
> >> > And it  can still fall down to the second case when the application
> >> > doesn't have another copy of the data somewhere.
> >>
> >> I read that "figures out" step as the application determining whether or
> >> not it had a redundant copy.
> >
> > Another 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Mon, May 02, 2016 at 04:25:51PM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 4:04 PM, Dave Chinner  wrote:
> > On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
> >> Dave Chinner  writes:
> >>
> >> > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> >> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> >> > You're assuming that only the DAX aware application accesses it's
> >> > files.  users, backup programs, data replicators, fileystem
> >> > re-organisers (e.g.  defragmenters) etc all may access the files and
> >> > they may throw errors. What then?
> >>
> >> I'm not sure how this is any different from regular storage.  If an
> >> application gets EIO, it's up to the app to decide what to do with that.
> >
> > Sure - they'll fail. But the question I'm asking is that if the
> > application that owns the data is supposed to do error recovery,
> > what happens when a 3rd party application hits an error? If that
> > consumes the error, the the app that owns the data won't ever get a
> > chance to correct the error.
> >
> > This is a minefield - a 3rd party app that swallows and clears DAX
> > based IO errors is a data corruption vector. can yo imagine if
> > *grep* did this? The model that is being promoted here effectively
> > allows this sort of behaviour - I don't really think we
> > should be architecting an error recovery strategy that has the
> > capability to go this wrong
> 
> Since when does grep write to a file on error?

That's precisely my point - it doesn't right now because there is no
onus on userspace applications to correct data errors when they are
found.

However, if the accepted model becomes "userspace needs to try to correct
errors in data automatically", the above scenario is a distinct
possiblity. I'm not saying grep will do this - I'm taking
the logical argument being presented to the extreme - but I'm sure
that there will be developers that have enough knowledge to know
they are supposed to do something with errors on pmem devices, but
not have enough knowledge to know the correct things to do.

And then the app mishandles a EINVAL error (or something like that)
and so we end up with buggy userspace apps trying to correct
errors in good data and causing data loss that way.

Do we really want to introduce a data integrity and error recovery
model where this sort of "bug" is a distinct possibly?

> >> >> > There's an implicit assumption that applications will keep redundant
> >> >> > copies of their data at the /application layer/ and be able to
> >> >> > automatically repair it?
> >>
> >> That's one way to do things.  It really depends on the application what
> >> it will do for recovery.
> >>
> >> >> > And then there's the implicit assumption that it will unlink and
> >> >> > free the entire file before writing a new copy
> >>
> >> I think Vishal was referring to restoring from backup.  cp itself will
> >> truncate the file before overwriting, iirc.
> >
> > Which version of cp? what happens if they use --sparse and the error
> > is in a zeroed region? There's so many assumptions about undefined userspace
> > environment, application and user behaviour being made here, and
> > it's all being handwaved away.
> >
> > I'm asking for this to be defined, demonstrated and documented as a
> > working model that cannot be abused and doesn't have holes the size
> > of trucks in it, not handwaving...
> 
> You lost me...  how are these patches abusing the existing semantics
> of -EIO and write to clear?

I haven't said that. I said there are assumptions about how
userspace will handle the error, but they aren't documented
anywhere. "copy a file using cp" is not a robust recovery solution -
it provides no guarantees about how the bad file and regions will be
recycled and the errors cleared. This effectively of puts it all on
the filesystems to deal with, even though you're trying to design an
error handling model that bypasses the filesystems and goes straight
to userspace.

If I can't understand how this is all supposed to work
because none of it is documented, then we have no chance that the
average admin is going to be able to understand it.

> 
> >> >> To summarize, the two cases we want to handle are:
> >> >> 1. Application has inbuilt recovery:
> >> >>   - hits badblock
> >> >>   - figures out it is able to recover the data
> >> >>   - handles SIGBUS or EIO
> >> >>   - does a (sector aligned) write() to restore the data
> >> >
> >> > The "figures out" step here is where >95% of the work we'd have to
> >> > do is. And that's in filesystem and block layer code, not
> >> > userspace, and userspace can't do that work in a signal handler.
> >> > And it  can still fall down to the second case when the application
> >> > doesn't have another copy of the data somewhere.
> >>
> >> I read that "figures out" step as the application determining whether or
> >> not it had a redundant copy.
> >
> > Another undocumented assumption, that doesn't simplify 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Rudoff, Andy

>> The takeaway is that msync() is 9-10x slower than userspace cache management.
>
>An alternative viewpoint: that flushing clean cachelines is
>extremely expensive on Intel CPUs. ;)
>
>i.e. Same numbers, different analysis from a different PoV, and
>that gives a *completely different conclusion*.
>
>Think about it for the moment. The hardware inefficiency being
>demonstrated could be fixed/optimised in the next hardware product
>cycle(s) and so will eventually go away. OTOH, we'll be stuck with
>whatever programming model we come up with for the next 30-40 years,
>and we'll never be able to fix flaws in it because applications will
>be depending on them. Do we really want to be stuck with a pmem
>model that is designed around the flaws and deficiencies of ~1st
>generation hardware?

Hi Dave,

Not sure I agree with your completely different conclusion.  (Not sure
I completely disagree either, but please let me raise some practical
points.)

First of all, let's say you're completely right and flushing clean
cache lines is extremely expensive.  So your solution is to wait for
the chip to be fixed?  Remember the model we're putting forward (which
we're working on documenting, because I fully agree with the lack of
documentation point you keep raising) requires the application to ASK
for the file system's permission before assuming flushing from user space
to persistence is allowed.  So that doesn't stick us with 30-40 years of
a flawed model.  I don't think the model is wrong, having spent lots of
research time on it, but if I'm full of crap, all we have to do is stop
telling the app that flushing from user space is allowed and it must go
back to using msync().  This is my understanding of what Dan suggested
at LSF and this is what I'm currently writing up.  By the way, the NVM
Libraries already contain the logic to ask if flushing from user space
is allowed, falling back to msync() if not.  Currently those libraries
check for DAX mappings.  But the points you raised about metadata changes
happening during page faults made us realize we have to ask the file
system to opt-in to allowing user space flushing, so that's what we're
changing the library to do.  See, we are listening :-)

Anyway, I doubt that flushing a clean cache line is extremely expensive.
Remember the code is building transactions to maintain a consistent
in-memory data structure in the face of sudden failure like powerloss.
So it is using the flushes to create store barriers, but not the block-
based store barriers we're used to in the storage world, but cache-line-
sized store barriers (usually multiples of cache lines, but most commonly
smaller than 4k of them).  So I think when you turn a cache line flush
into an msync(), you're seeing some dirty stuff get flushed before it
is time to flush it.  I'm not sure though, but certainly we could spend
more time testing & measuring.

More importantly, I think it is interesting to decide what we want the
pmem programming model to be long-term.  I think we want applications to
just map pmem, do normal stores to it, and assume they are persistent.
This is quite different from the 30-year-old POSIX Model where msync()
is required.  But I think it is cleaner, easier to understand, and less
error-prone.  So why doesn't it work that way right now?  Because we're
finding it impractical.  Using write-through caching for pmem simply
doesn't perform well, and depending on the platform to flush the CPU
caches on shutdown/powerfail is not practical yet.  But I think the day
will come when it is practical.

So given that long-term target, the idea is for an application to ask if
the msync() calls are required, or if just flushing the CPU caches is
sufficient for persistence.  Then, we're also adding an ACPI property
that allows SW to discover if the caches are flushed automatically
on shutdown/powerloss.  Initially that will only be true for custom
platforms, but hopefully it can be available more broadly in the future.
The result will be that the programming model gets simpler as more and
more hardware requires less explicit flushing.

Now I'll go back to writing up the big picture for this programming
model so I can ask you for comments on that as well...


-andy


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Rudoff, Andy

>> The takeaway is that msync() is 9-10x slower than userspace cache management.
>
>An alternative viewpoint: that flushing clean cachelines is
>extremely expensive on Intel CPUs. ;)
>
>i.e. Same numbers, different analysis from a different PoV, and
>that gives a *completely different conclusion*.
>
>Think about it for the moment. The hardware inefficiency being
>demonstrated could be fixed/optimised in the next hardware product
>cycle(s) and so will eventually go away. OTOH, we'll be stuck with
>whatever programming model we come up with for the next 30-40 years,
>and we'll never be able to fix flaws in it because applications will
>be depending on them. Do we really want to be stuck with a pmem
>model that is designed around the flaws and deficiencies of ~1st
>generation hardware?

Hi Dave,

Not sure I agree with your completely different conclusion.  (Not sure
I completely disagree either, but please let me raise some practical
points.)

First of all, let's say you're completely right and flushing clean
cache lines is extremely expensive.  So your solution is to wait for
the chip to be fixed?  Remember the model we're putting forward (which
we're working on documenting, because I fully agree with the lack of
documentation point you keep raising) requires the application to ASK
for the file system's permission before assuming flushing from user space
to persistence is allowed.  So that doesn't stick us with 30-40 years of
a flawed model.  I don't think the model is wrong, having spent lots of
research time on it, but if I'm full of crap, all we have to do is stop
telling the app that flushing from user space is allowed and it must go
back to using msync().  This is my understanding of what Dan suggested
at LSF and this is what I'm currently writing up.  By the way, the NVM
Libraries already contain the logic to ask if flushing from user space
is allowed, falling back to msync() if not.  Currently those libraries
check for DAX mappings.  But the points you raised about metadata changes
happening during page faults made us realize we have to ask the file
system to opt-in to allowing user space flushing, so that's what we're
changing the library to do.  See, we are listening :-)

Anyway, I doubt that flushing a clean cache line is extremely expensive.
Remember the code is building transactions to maintain a consistent
in-memory data structure in the face of sudden failure like powerloss.
So it is using the flushes to create store barriers, but not the block-
based store barriers we're used to in the storage world, but cache-line-
sized store barriers (usually multiples of cache lines, but most commonly
smaller than 4k of them).  So I think when you turn a cache line flush
into an msync(), you're seeing some dirty stuff get flushed before it
is time to flush it.  I'm not sure though, but certainly we could spend
more time testing & measuring.

More importantly, I think it is interesting to decide what we want the
pmem programming model to be long-term.  I think we want applications to
just map pmem, do normal stores to it, and assume they are persistent.
This is quite different from the 30-year-old POSIX Model where msync()
is required.  But I think it is cleaner, easier to understand, and less
error-prone.  So why doesn't it work that way right now?  Because we're
finding it impractical.  Using write-through caching for pmem simply
doesn't perform well, and depending on the platform to flush the CPU
caches on shutdown/powerfail is not practical yet.  But I think the day
will come when it is practical.

So given that long-term target, the idea is for an application to ask if
the msync() calls are required, or if just flushing the CPU caches is
sufficient for persistence.  Then, we're also adding an ACPI property
that allows SW to discover if the caches are flushed automatically
on shutdown/powerloss.  Initially that will only be true for custom
platforms, but hopefully it can be available more broadly in the future.
The result will be that the programming model gets simpler as more and
more hardware requires less explicit flushing.

Now I'll go back to writing up the big picture for this programming
model so I can ask you for comments on that as well...


-andy


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Mon, May 02, 2016 at 10:53:25AM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 8:18 AM, Jeff Moyer  wrote:
> > Dave Chinner  writes:
> [..]
> >> We need some form of redundancy and correction in the PMEM stack to
> >> prevent single sector errors from taking down services until an
> >> administrator can correct the problem. I'm trying to understand
> >> where this is supposed to fit into the picture - at this point I
> >> really don't think userspace applications are going to be able to do
> >> this reliably
> >
> > Not all storage is configured into a RAID volume, and in some instances,
> > the application is better positioned to recover the data (gluster/ceph,
> > for example).  It really comes down to whether applications or libraries
> > will want to implement redundancy themselves in order to get a bump in
> > performance by not going through the kernel.  And I think I know what
> > your opinion is on that front.  :-)
> >
> > Speaking of which, did you see the numbers Dan shared at LSF on how much
> > overhead there is in calling into the kernel for syncing?  Dan, can/did
> > you publish that spreadsheet somewhere?
> 
> Here it is:
> 
> https://docs.google.com/spreadsheets/d/1pwr9psy6vtB9DOsc2bUdXevJRz5Guf6laZ4DaZlkhoo/edit?usp=sharing
> 
> On the "Filtered" tab I have some of the comparisons where:

Those numbers are really wacky - the inconsistent decimal place
representation makes it really, really hard to read the differences
in orders of magnitude, too. Let's take the first numbers - noop, 64
byte ops are:

threads ops/s
190M
2   310M
465M
8   175M
16  426M

Why aren't these linear? And if the test is not running in an
environment where these are controlled and linear, how valid are the
rest of the tests and hence the comparison.

> noop => don't call msync and don't flush caches in userspace
> 
> persist => cache flushing only in userspace and only on individual cache lines

So these look a lot more linear than the no-op behaviour, so I'll
just ignore the no-op results for now.

> persist_4k => cache flushing only in userspace, but flushing is
> performed in 4K aligned units

Urg, your "vs persist" percentages are all wrong. You can't have a
"-1000%" difference, you have "persist 4k" running at 10% of the
speed of "persist".

So, with that in mind, the "persist_4k" speed is:

 ops/s  single thread
Sizevs "persist"4k flush rate
  64 10% 834k
 128 13% 849k
 256 15% 410k(one off variation?)
 512 20% 860k
1024 25% 850k
2048 50% 840k
4096none 836k
8192none 410k

What we see here is that the CPU(s) can flush the 4k pages at a rate
of roughly 850,000 flushes/s, whilst the 64 byte flush rate is
around 8.8M flushes/s.  This is clearly demonstrated in the numbers
- as the dirty object size approaches the cache flush granularity,
the speed approaches single cacheline flush granularity speed.

Comparing 4k vs 64b flushes, we have 63 clean cache line flushes
taking roughly the same time as 9 dirty cache line flushes. Nice
numbers - that means a clean cache line flush has ~14% of the
overhead of dirty cache line flush. Seems rather high - it's tens of
CPU cycles to determine that the flush is a no-op for that
cacheline.

Fixing this seems like a hardware optimisation issue to me, but I
still have to question how many applications are going to have such
fine-grained random synchronous memory writes that this actually
matters in practice? If we are doing such small writes across
multiple different 4k pages, then TLB overhead for all the page
faults is going to be as much of an issue as 4k cache flushes...

> msync => same granularity flushing as the 'persist' case, but the
> kernel internally promotes this to a 4K sized / aligned flush

So you're calling msync for every modification that is made? What
application needs to do that? Anyway, page flush rates paint an
interesting picture:

single threadversus
Size4k flush rate   persist_4k
  64 655k78%
 128 655k81%
 256 670k   163%  (* persist 4k number low) 
 512 681k79%
1024 666k78%
2048 650k77%
4096 652k78%
8192 390k95%

msync adds relatively little overhead (~20% extra overhead) compared
to the performance loss from the 4k flush granularity change. And
given this appears to be a worst case test scenario (and I'm sure
msync could be improved), I don't think this demonstrates a problem
with using msync.

IMO, these numbers don't support the argument that the *msync
model* for data integrity 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Mon, May 02, 2016 at 10:53:25AM -0700, Dan Williams wrote:
> On Mon, May 2, 2016 at 8:18 AM, Jeff Moyer  wrote:
> > Dave Chinner  writes:
> [..]
> >> We need some form of redundancy and correction in the PMEM stack to
> >> prevent single sector errors from taking down services until an
> >> administrator can correct the problem. I'm trying to understand
> >> where this is supposed to fit into the picture - at this point I
> >> really don't think userspace applications are going to be able to do
> >> this reliably
> >
> > Not all storage is configured into a RAID volume, and in some instances,
> > the application is better positioned to recover the data (gluster/ceph,
> > for example).  It really comes down to whether applications or libraries
> > will want to implement redundancy themselves in order to get a bump in
> > performance by not going through the kernel.  And I think I know what
> > your opinion is on that front.  :-)
> >
> > Speaking of which, did you see the numbers Dan shared at LSF on how much
> > overhead there is in calling into the kernel for syncing?  Dan, can/did
> > you publish that spreadsheet somewhere?
> 
> Here it is:
> 
> https://docs.google.com/spreadsheets/d/1pwr9psy6vtB9DOsc2bUdXevJRz5Guf6laZ4DaZlkhoo/edit?usp=sharing
> 
> On the "Filtered" tab I have some of the comparisons where:

Those numbers are really wacky - the inconsistent decimal place
representation makes it really, really hard to read the differences
in orders of magnitude, too. Let's take the first numbers - noop, 64
byte ops are:

threads ops/s
190M
2   310M
465M
8   175M
16  426M

Why aren't these linear? And if the test is not running in an
environment where these are controlled and linear, how valid are the
rest of the tests and hence the comparison.

> noop => don't call msync and don't flush caches in userspace
> 
> persist => cache flushing only in userspace and only on individual cache lines

So these look a lot more linear than the no-op behaviour, so I'll
just ignore the no-op results for now.

> persist_4k => cache flushing only in userspace, but flushing is
> performed in 4K aligned units

Urg, your "vs persist" percentages are all wrong. You can't have a
"-1000%" difference, you have "persist 4k" running at 10% of the
speed of "persist".

So, with that in mind, the "persist_4k" speed is:

 ops/s  single thread
Sizevs "persist"4k flush rate
  64 10% 834k
 128 13% 849k
 256 15% 410k(one off variation?)
 512 20% 860k
1024 25% 850k
2048 50% 840k
4096none 836k
8192none 410k

What we see here is that the CPU(s) can flush the 4k pages at a rate
of roughly 850,000 flushes/s, whilst the 64 byte flush rate is
around 8.8M flushes/s.  This is clearly demonstrated in the numbers
- as the dirty object size approaches the cache flush granularity,
the speed approaches single cacheline flush granularity speed.

Comparing 4k vs 64b flushes, we have 63 clean cache line flushes
taking roughly the same time as 9 dirty cache line flushes. Nice
numbers - that means a clean cache line flush has ~14% of the
overhead of dirty cache line flush. Seems rather high - it's tens of
CPU cycles to determine that the flush is a no-op for that
cacheline.

Fixing this seems like a hardware optimisation issue to me, but I
still have to question how many applications are going to have such
fine-grained random synchronous memory writes that this actually
matters in practice? If we are doing such small writes across
multiple different 4k pages, then TLB overhead for all the page
faults is going to be as much of an issue as 4k cache flushes...

> msync => same granularity flushing as the 'persist' case, but the
> kernel internally promotes this to a 4K sized / aligned flush

So you're calling msync for every modification that is made? What
application needs to do that? Anyway, page flush rates paint an
interesting picture:

single threadversus
Size4k flush rate   persist_4k
  64 655k78%
 128 655k81%
 256 670k   163%  (* persist 4k number low) 
 512 681k79%
1024 666k78%
2048 650k77%
4096 652k78%
8192 390k95%

msync adds relatively little overhead (~20% extra overhead) compared
to the performance loss from the 4k flush granularity change. And
given this appears to be a worst case test scenario (and I'm sure
msync could be improved), I don't think this demonstrates a problem
with using msync.

IMO, these numbers don't support the argument that the *msync
model* for data integrity for DAX is flawed, unworkable, or too

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dan Williams
On Mon, May 2, 2016 at 4:04 PM, Dave Chinner  wrote:
> On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
>> Dave Chinner  writes:
>>
>> > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
>> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
>> > You're assuming that only the DAX aware application accesses it's
>> > files.  users, backup programs, data replicators, fileystem
>> > re-organisers (e.g.  defragmenters) etc all may access the files and
>> > they may throw errors. What then?
>>
>> I'm not sure how this is any different from regular storage.  If an
>> application gets EIO, it's up to the app to decide what to do with that.
>
> Sure - they'll fail. But the question I'm asking is that if the
> application that owns the data is supposed to do error recovery,
> what happens when a 3rd party application hits an error? If that
> consumes the error, the the app that owns the data won't ever get a
> chance to correct the error.
>
> This is a minefield - a 3rd party app that swallows and clears DAX
> based IO errors is a data corruption vector. can yo imagine if
> *grep* did this? The model that is being promoted here effectively
> allows this sort of behaviour - I don't really think we
> should be architecting an error recovery strategy that has the
> capability to go this wrong

Since when does grep write to a file on error?

>
>> >> > Where does the application find the data that was lost to be able to
>> >> > rewrite it?
>> >>
>> >> The data that was lost is gone -- this assumes the application has some
>> >> ability to recover using a journal/log or other redundancy - yes, at the
>> >> application layer. If it doesn't have this sort of capability, the only
>> >> option is to restore files from a backup/mirror.
>> >
>> > So the architecture has a built in assumption that only userspace
>> > can handle data loss?
>>
>> Remember that the proposed programming model completely bypasses the
>> kernel, so yes, it is expected that user-space will have to deal with
>> the problem.
>
> No, it doesn't completely bypass the kernel - the kernel is the
> infrastructure that catches the errors in the first place, and it
> owns and controls all the metadata that corresponds to the physical
> location of that error. The only thing the kernel doesn't own is the
> *contents* of that location.
>
>> > What about filesytsems like NOVA, that use log structured design to
>> > provide DAX w/ update atomicity and can potentially also provide
>> > redundancy/repair through the same mechanisms? Won't pmem native
>> > filesystems with built in data protection features like this remove
>> > the need for adding all this to userspace applications?
>>
>> I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
>> with this for existing file systems, right?
>
> Yes, but that misses my point that it seems that the design is only
> focussed on userspace and existing filesystems and there is no
> consideration of kernel side functionality that could do transparent
> recovery
>
>> > If so, shouldn't that be the focus of development rahter than
>> > placing the burden on userspace apps to handle storage repair
>> > situations?
>>
>> It really depends on the programming model.  In the model Vishal is
>> talking about, either the applications themselves or the libraries they
>> link to are expected to implement the redundancies where necessary.
>
> IOWs, filesystems no longer have any control over data integrity.
> Yet it's the filesystem developers who will still be responsible for
> data integrity and when the filesystem has a data coruption event
> we'll get blamed and the filesystem gets a bad name, even though
> it's entirely the applications fault. We've seen this time and time
> again - application developers cannot be trusted to guarantee data
> integrity. yes, some apps will be fine, but do you really expect
> application devs that refuse to use fsync because it's too slow are
> going to have a different approach to integrity when it comes to
> DAX?

Yes, completely agree.  The applications that will implement competent
error recovery with these mechanisms will be vanishingly small, and
there is definite room for a kernel data-redundancy solution that
builds on these patches.

>
>> >> > There's an implicit assumption that applications will keep redundant
>> >> > copies of their data at the /application layer/ and be able to
>> >> > automatically repair it?
>>
>> That's one way to do things.  It really depends on the application what
>> it will do for recovery.
>>
>> >> > And then there's the implicit assumption that it will unlink and
>> >> > free the entire file before writing a new copy
>>
>> I think Vishal was referring to restoring from backup.  cp itself will
>> truncate the file before overwriting, iirc.
>
> Which version of cp? what happens if they use --sparse and the error
> is in a zeroed region? There's so many assumptions 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dan Williams
On Mon, May 2, 2016 at 4:04 PM, Dave Chinner  wrote:
> On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
>> Dave Chinner  writes:
>>
>> > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
>> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
>> > You're assuming that only the DAX aware application accesses it's
>> > files.  users, backup programs, data replicators, fileystem
>> > re-organisers (e.g.  defragmenters) etc all may access the files and
>> > they may throw errors. What then?
>>
>> I'm not sure how this is any different from regular storage.  If an
>> application gets EIO, it's up to the app to decide what to do with that.
>
> Sure - they'll fail. But the question I'm asking is that if the
> application that owns the data is supposed to do error recovery,
> what happens when a 3rd party application hits an error? If that
> consumes the error, the the app that owns the data won't ever get a
> chance to correct the error.
>
> This is a minefield - a 3rd party app that swallows and clears DAX
> based IO errors is a data corruption vector. can yo imagine if
> *grep* did this? The model that is being promoted here effectively
> allows this sort of behaviour - I don't really think we
> should be architecting an error recovery strategy that has the
> capability to go this wrong

Since when does grep write to a file on error?

>
>> >> > Where does the application find the data that was lost to be able to
>> >> > rewrite it?
>> >>
>> >> The data that was lost is gone -- this assumes the application has some
>> >> ability to recover using a journal/log or other redundancy - yes, at the
>> >> application layer. If it doesn't have this sort of capability, the only
>> >> option is to restore files from a backup/mirror.
>> >
>> > So the architecture has a built in assumption that only userspace
>> > can handle data loss?
>>
>> Remember that the proposed programming model completely bypasses the
>> kernel, so yes, it is expected that user-space will have to deal with
>> the problem.
>
> No, it doesn't completely bypass the kernel - the kernel is the
> infrastructure that catches the errors in the first place, and it
> owns and controls all the metadata that corresponds to the physical
> location of that error. The only thing the kernel doesn't own is the
> *contents* of that location.
>
>> > What about filesytsems like NOVA, that use log structured design to
>> > provide DAX w/ update atomicity and can potentially also provide
>> > redundancy/repair through the same mechanisms? Won't pmem native
>> > filesystems with built in data protection features like this remove
>> > the need for adding all this to userspace applications?
>>
>> I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
>> with this for existing file systems, right?
>
> Yes, but that misses my point that it seems that the design is only
> focussed on userspace and existing filesystems and there is no
> consideration of kernel side functionality that could do transparent
> recovery
>
>> > If so, shouldn't that be the focus of development rahter than
>> > placing the burden on userspace apps to handle storage repair
>> > situations?
>>
>> It really depends on the programming model.  In the model Vishal is
>> talking about, either the applications themselves or the libraries they
>> link to are expected to implement the redundancies where necessary.
>
> IOWs, filesystems no longer have any control over data integrity.
> Yet it's the filesystem developers who will still be responsible for
> data integrity and when the filesystem has a data coruption event
> we'll get blamed and the filesystem gets a bad name, even though
> it's entirely the applications fault. We've seen this time and time
> again - application developers cannot be trusted to guarantee data
> integrity. yes, some apps will be fine, but do you really expect
> application devs that refuse to use fsync because it's too slow are
> going to have a different approach to integrity when it comes to
> DAX?

Yes, completely agree.  The applications that will implement competent
error recovery with these mechanisms will be vanishingly small, and
there is definite room for a kernel data-redundancy solution that
builds on these patches.

>
>> >> > There's an implicit assumption that applications will keep redundant
>> >> > copies of their data at the /application layer/ and be able to
>> >> > automatically repair it?
>>
>> That's one way to do things.  It really depends on the application what
>> it will do for recovery.
>>
>> >> > And then there's the implicit assumption that it will unlink and
>> >> > free the entire file before writing a new copy
>>
>> I think Vishal was referring to restoring from backup.  cp itself will
>> truncate the file before overwriting, iirc.
>
> Which version of cp? what happens if they use --sparse and the error
> is in a zeroed region? There's so many assumptions about undefined userspace
> environment, 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Verma, Vishal L
On Tue, 2016-05-03 at 09:04 +1000, Dave Chinner wrote:
> On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
> > 
> > Dave Chinner  writes:
> > 
> > > 
> > > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> > > > 
> > > > On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> > > You're assuming that only the DAX aware application accesses it's
> > > files.  users, backup programs, data replicators, fileystem
> > > re-organisers (e.g.  defragmenters) etc all may access the files
> > > and
> > > they may throw errors. What then?
> > I'm not sure how this is any different from regular storage.  If an
> > application gets EIO, it's up to the app to decide what to do with
> > that.
> Sure - they'll fail. But the question I'm asking is that if the
> application that owns the data is supposed to do error recovery,
> what happens when a 3rd party application hits an error? If that
> consumes the error, the the app that owns the data won't ever get a
> chance to correct the error.
> 
> This is a minefield - a 3rd party app that swallows and clears DAX
> based IO errors is a data corruption vector. can yo imagine if
> *grep* did this? The model that is being promoted here effectively
> allows this sort of behaviour - I don't really think we
> should be architecting an error recovery strategy that has the
> capability to go this wrong
> 

Just to address this bit - No. Any number of backup/3rd party
application can hit the error and _fail_ but surely they won't try to
_write_ the bad location? Only a write to the bad sector will clear it
in this model - and until such time, all reads will just keep erroring
out. This works for DAX/mmap based reads/writes too - mmap-stores
won't/can't clear errors - you have to go through the block path, and in
the altest version of my patch set, that has to be explicitly through
O_DIRECT.

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Verma, Vishal L
On Tue, 2016-05-03 at 09:04 +1000, Dave Chinner wrote:
> On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
> > 
> > Dave Chinner  writes:
> > 
> > > 
> > > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> > > > 
> > > > On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> > > You're assuming that only the DAX aware application accesses it's
> > > files.  users, backup programs, data replicators, fileystem
> > > re-organisers (e.g.  defragmenters) etc all may access the files
> > > and
> > > they may throw errors. What then?
> > I'm not sure how this is any different from regular storage.  If an
> > application gets EIO, it's up to the app to decide what to do with
> > that.
> Sure - they'll fail. But the question I'm asking is that if the
> application that owns the data is supposed to do error recovery,
> what happens when a 3rd party application hits an error? If that
> consumes the error, the the app that owns the data won't ever get a
> chance to correct the error.
> 
> This is a minefield - a 3rd party app that swallows and clears DAX
> based IO errors is a data corruption vector. can yo imagine if
> *grep* did this? The model that is being promoted here effectively
> allows this sort of behaviour - I don't really think we
> should be architecting an error recovery strategy that has the
> capability to go this wrong
> 

Just to address this bit - No. Any number of backup/3rd party
application can hit the error and _fail_ but surely they won't try to
_write_ the bad location? Only a write to the bad sector will clear it
in this model - and until such time, all reads will just keep erroring
out. This works for DAX/mmap based reads/writes too - mmap-stores
won't/can't clear errors - you have to go through the block path, and in
the altest version of my patch set, that has to be explicitly through
O_DIRECT.

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
> Dave Chinner  writes:
> 
> > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> > You're assuming that only the DAX aware application accesses it's
> > files.  users, backup programs, data replicators, fileystem
> > re-organisers (e.g.  defragmenters) etc all may access the files and
> > they may throw errors. What then?
> 
> I'm not sure how this is any different from regular storage.  If an
> application gets EIO, it's up to the app to decide what to do with that.

Sure - they'll fail. But the question I'm asking is that if the
application that owns the data is supposed to do error recovery,
what happens when a 3rd party application hits an error? If that
consumes the error, the the app that owns the data won't ever get a
chance to correct the error.

This is a minefield - a 3rd party app that swallows and clears DAX
based IO errors is a data corruption vector. can yo imagine if
*grep* did this? The model that is being promoted here effectively
allows this sort of behaviour - I don't really think we
should be architecting an error recovery strategy that has the
capability to go this wrong

> >> > Where does the application find the data that was lost to be able to
> >> > rewrite it?
> >> 
> >> The data that was lost is gone -- this assumes the application has some
> >> ability to recover using a journal/log or other redundancy - yes, at the
> >> application layer. If it doesn't have this sort of capability, the only
> >> option is to restore files from a backup/mirror.
> >
> > So the architecture has a built in assumption that only userspace
> > can handle data loss?
> 
> Remember that the proposed programming model completely bypasses the
> kernel, so yes, it is expected that user-space will have to deal with
> the problem.

No, it doesn't completely bypass the kernel - the kernel is the
infrastructure that catches the errors in the first place, and it
owns and controls all the metadata that corresponds to the physical
location of that error. The only thing the kernel doesn't own is the
*contents* of that location.

> > What about filesytsems like NOVA, that use log structured design to
> > provide DAX w/ update atomicity and can potentially also provide
> > redundancy/repair through the same mechanisms? Won't pmem native
> > filesystems with built in data protection features like this remove
> > the need for adding all this to userspace applications?
> 
> I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
> with this for existing file systems, right?

Yes, but that misses my point that it seems that the design is only
focussed on userspace and existing filesystems and there is no
consideration of kernel side functionality that could do transparent
recovery

> > If so, shouldn't that be the focus of development rahter than
> > placing the burden on userspace apps to handle storage repair
> > situations?
> 
> It really depends on the programming model.  In the model Vishal is
> talking about, either the applications themselves or the libraries they
> link to are expected to implement the redundancies where necessary.

IOWs, filesystems no longer have any control over data integrity.
Yet it's the filesystem developers who will still be responsible for
data integrity and when the filesystem has a data coruption event
we'll get blamed and the filesystem gets a bad name, even though
it's entirely the applications fault. We've seen this time and time
again - application developers cannot be trusted to guarantee data
integrity. yes, some apps will be fine, but do you really expect
application devs that refuse to use fsync because it's too slow are
going to have a different approach to integrity when it comes to
DAX?

> >> > There's an implicit assumption that applications will keep redundant
> >> > copies of their data at the /application layer/ and be able to
> >> > automatically repair it?
> 
> That's one way to do things.  It really depends on the application what
> it will do for recovery.
> 
> >> > And then there's the implicit assumption that it will unlink and
> >> > free the entire file before writing a new copy
> 
> I think Vishal was referring to restoring from backup.  cp itself will
> truncate the file before overwriting, iirc.

Which version of cp? what happens if they use --sparse and the error
is in a zeroed region? There's so many assumptions about undefined userspace
environment, application and user behaviour being made here, and
it's all being handwaved away.

I'm asking for this to be defined, demonstrated and documented as a
working model that cannot be abused and doesn't have holes the size
of trucks in it, not handwaving...

> >> To summarize, the two cases we want to handle are:
> >> 1. Application has inbuilt recovery:
> >>   - hits badblock
> >>   - figures out it is able to recover the data
> >>   

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dave Chinner
On Mon, May 02, 2016 at 11:18:36AM -0400, Jeff Moyer wrote:
> Dave Chinner  writes:
> 
> > On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> >> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> > You're assuming that only the DAX aware application accesses it's
> > files.  users, backup programs, data replicators, fileystem
> > re-organisers (e.g.  defragmenters) etc all may access the files and
> > they may throw errors. What then?
> 
> I'm not sure how this is any different from regular storage.  If an
> application gets EIO, it's up to the app to decide what to do with that.

Sure - they'll fail. But the question I'm asking is that if the
application that owns the data is supposed to do error recovery,
what happens when a 3rd party application hits an error? If that
consumes the error, the the app that owns the data won't ever get a
chance to correct the error.

This is a minefield - a 3rd party app that swallows and clears DAX
based IO errors is a data corruption vector. can yo imagine if
*grep* did this? The model that is being promoted here effectively
allows this sort of behaviour - I don't really think we
should be architecting an error recovery strategy that has the
capability to go this wrong

> >> > Where does the application find the data that was lost to be able to
> >> > rewrite it?
> >> 
> >> The data that was lost is gone -- this assumes the application has some
> >> ability to recover using a journal/log or other redundancy - yes, at the
> >> application layer. If it doesn't have this sort of capability, the only
> >> option is to restore files from a backup/mirror.
> >
> > So the architecture has a built in assumption that only userspace
> > can handle data loss?
> 
> Remember that the proposed programming model completely bypasses the
> kernel, so yes, it is expected that user-space will have to deal with
> the problem.

No, it doesn't completely bypass the kernel - the kernel is the
infrastructure that catches the errors in the first place, and it
owns and controls all the metadata that corresponds to the physical
location of that error. The only thing the kernel doesn't own is the
*contents* of that location.

> > What about filesytsems like NOVA, that use log structured design to
> > provide DAX w/ update atomicity and can potentially also provide
> > redundancy/repair through the same mechanisms? Won't pmem native
> > filesystems with built in data protection features like this remove
> > the need for adding all this to userspace applications?
> 
> I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
> with this for existing file systems, right?

Yes, but that misses my point that it seems that the design is only
focussed on userspace and existing filesystems and there is no
consideration of kernel side functionality that could do transparent
recovery

> > If so, shouldn't that be the focus of development rahter than
> > placing the burden on userspace apps to handle storage repair
> > situations?
> 
> It really depends on the programming model.  In the model Vishal is
> talking about, either the applications themselves or the libraries they
> link to are expected to implement the redundancies where necessary.

IOWs, filesystems no longer have any control over data integrity.
Yet it's the filesystem developers who will still be responsible for
data integrity and when the filesystem has a data coruption event
we'll get blamed and the filesystem gets a bad name, even though
it's entirely the applications fault. We've seen this time and time
again - application developers cannot be trusted to guarantee data
integrity. yes, some apps will be fine, but do you really expect
application devs that refuse to use fsync because it's too slow are
going to have a different approach to integrity when it comes to
DAX?

> >> > There's an implicit assumption that applications will keep redundant
> >> > copies of their data at the /application layer/ and be able to
> >> > automatically repair it?
> 
> That's one way to do things.  It really depends on the application what
> it will do for recovery.
> 
> >> > And then there's the implicit assumption that it will unlink and
> >> > free the entire file before writing a new copy
> 
> I think Vishal was referring to restoring from backup.  cp itself will
> truncate the file before overwriting, iirc.

Which version of cp? what happens if they use --sparse and the error
is in a zeroed region? There's so many assumptions about undefined userspace
environment, application and user behaviour being made here, and
it's all being handwaved away.

I'm asking for this to be defined, demonstrated and documented as a
working model that cannot be abused and doesn't have holes the size
of trucks in it, not handwaving...

> >> To summarize, the two cases we want to handle are:
> >> 1. Application has inbuilt recovery:
> >>   - hits badblock
> >>   - figures out it is able to recover the data
> >>   - handles SIGBUS or 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dan Williams
On Mon, May 2, 2016 at 8:18 AM, Jeff Moyer  wrote:
> Dave Chinner  writes:
[..]
>> We need some form of redundancy and correction in the PMEM stack to
>> prevent single sector errors from taking down services until an
>> administrator can correct the problem. I'm trying to understand
>> where this is supposed to fit into the picture - at this point I
>> really don't think userspace applications are going to be able to do
>> this reliably
>
> Not all storage is configured into a RAID volume, and in some instances,
> the application is better positioned to recover the data (gluster/ceph,
> for example).  It really comes down to whether applications or libraries
> will want to implement redundancy themselves in order to get a bump in
> performance by not going through the kernel.  And I think I know what
> your opinion is on that front.  :-)
>
> Speaking of which, did you see the numbers Dan shared at LSF on how much
> overhead there is in calling into the kernel for syncing?  Dan, can/did
> you publish that spreadsheet somewhere?

Here it is:

https://docs.google.com/spreadsheets/d/1pwr9psy6vtB9DOsc2bUdXevJRz5Guf6laZ4DaZlkhoo/edit?usp=sharing

On the "Filtered" tab I have some of the comparisons where:

noop => don't call msync and don't flush caches in userspace

persist => cache flushing only in userspace and only on individual cache lines

persist_4k => cache flushing only in userspace, but flushing is
performed in 4K aligned units

msync => same granularity flushing as the 'persist' case, but the
kernel internally promotes this to a 4K sized / aligned flush

msync_0 => synthetic case where msync() returns immediately and does
no other work

The takeaway is that msync() is 9-10x slower than userspace cache management.

Let me know if there are any questions and I can add an NVML developer
to this thread...


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Dan Williams
On Mon, May 2, 2016 at 8:18 AM, Jeff Moyer  wrote:
> Dave Chinner  writes:
[..]
>> We need some form of redundancy and correction in the PMEM stack to
>> prevent single sector errors from taking down services until an
>> administrator can correct the problem. I'm trying to understand
>> where this is supposed to fit into the picture - at this point I
>> really don't think userspace applications are going to be able to do
>> this reliably
>
> Not all storage is configured into a RAID volume, and in some instances,
> the application is better positioned to recover the data (gluster/ceph,
> for example).  It really comes down to whether applications or libraries
> will want to implement redundancy themselves in order to get a bump in
> performance by not going through the kernel.  And I think I know what
> your opinion is on that front.  :-)
>
> Speaking of which, did you see the numbers Dan shared at LSF on how much
> overhead there is in calling into the kernel for syncing?  Dan, can/did
> you publish that spreadsheet somewhere?

Here it is:

https://docs.google.com/spreadsheets/d/1pwr9psy6vtB9DOsc2bUdXevJRz5Guf6laZ4DaZlkhoo/edit?usp=sharing

On the "Filtered" tab I have some of the comparisons where:

noop => don't call msync and don't flush caches in userspace

persist => cache flushing only in userspace and only on individual cache lines

persist_4k => cache flushing only in userspace, but flushing is
performed in 4K aligned units

msync => same granularity flushing as the 'persist' case, but the
kernel internally promotes this to a 4K sized / aligned flush

msync_0 => synthetic case where msync() returns immediately and does
no other work

The takeaway is that msync() is 9-10x slower than userspace cache management.

Let me know if there are any questions and I can add an NVML developer
to this thread...


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Jeff Moyer
Dave Chinner  writes:

> On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
>> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> You're assuming that only the DAX aware application accesses it's
> files.  users, backup programs, data replicators, fileystem
> re-organisers (e.g.  defragmenters) etc all may access the files and
> they may throw errors. What then?

I'm not sure how this is any different from regular storage.  If an
application gets EIO, it's up to the app to decide what to do with that.

>> > Where does the application find the data that was lost to be able to
>> > rewrite it?
>> 
>> The data that was lost is gone -- this assumes the application has some
>> ability to recover using a journal/log or other redundancy - yes, at the
>> application layer. If it doesn't have this sort of capability, the only
>> option is to restore files from a backup/mirror.
>
> So the architecture has a built in assumption that only userspace
> can handle data loss?

Remember that the proposed programming model completely bypasses the
kernel, so yes, it is expected that user-space will have to deal with
the problem.

> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?

I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
with this for existing file systems, right?

> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

It really depends on the programming model.  In the model Vishal is
talking about, either the applications themselves or the libraries they
link to are expected to implement the redundancies where necessary.

>> > There's an implicit assumption that applications will keep redundant
>> > copies of their data at the /application layer/ and be able to
>> > automatically repair it?

That's one way to do things.  It really depends on the application what
it will do for recovery.

>> > And then there's the implicit assumption that it will unlink and
>> > free the entire file before writing a new copy

I think Vishal was referring to restoring from backup.  cp itself will
truncate the file before overwriting, iirc.

>> To summarize, the two cases we want to handle are:
>> 1. Application has inbuilt recovery:
>>   - hits badblock
>>   - figures out it is able to recover the data
>>   - handles SIGBUS or EIO
>>   - does a (sector aligned) write() to restore the data
>
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it  can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.

I read that "figures out" step as the application determining whether or
not it had a redundant copy.

> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.

Do we even need that?  What if we added an FIEMAP flag for determining
bad blocks.  The file system could simply walk the list of extents for
the file and check the corresponding disk blocks.  No reverse mapping
required.  Also note that DM/MD don't support direct_access(), either,
so I don't think they're relevant for this discussion.

>> 2. Application doesn't have any inbuilt recovery mechanism
>>   - hits badblock
>>   - gets SIGBUS (or EIO) and crashes
>>   - Sysadmin restores file from backup
>
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
>
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.  multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
>
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably

Not all storage is configured into a RAID volume, and in some instances,
the application is better positioned to recover the data (gluster/ceph,
for example).  It really comes down to whether applications or libraries
will want to 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-05-02 Thread Jeff Moyer
Dave Chinner  writes:

> On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
>> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> You're assuming that only the DAX aware application accesses it's
> files.  users, backup programs, data replicators, fileystem
> re-organisers (e.g.  defragmenters) etc all may access the files and
> they may throw errors. What then?

I'm not sure how this is any different from regular storage.  If an
application gets EIO, it's up to the app to decide what to do with that.

>> > Where does the application find the data that was lost to be able to
>> > rewrite it?
>> 
>> The data that was lost is gone -- this assumes the application has some
>> ability to recover using a journal/log or other redundancy - yes, at the
>> application layer. If it doesn't have this sort of capability, the only
>> option is to restore files from a backup/mirror.
>
> So the architecture has a built in assumption that only userspace
> can handle data loss?

Remember that the proposed programming model completely bypasses the
kernel, so yes, it is expected that user-space will have to deal with
the problem.

> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?

I don't think we'll /only/ support NOVA for pmem.  So we'll have to deal
with this for existing file systems, right?

> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

It really depends on the programming model.  In the model Vishal is
talking about, either the applications themselves or the libraries they
link to are expected to implement the redundancies where necessary.

>> > There's an implicit assumption that applications will keep redundant
>> > copies of their data at the /application layer/ and be able to
>> > automatically repair it?

That's one way to do things.  It really depends on the application what
it will do for recovery.

>> > And then there's the implicit assumption that it will unlink and
>> > free the entire file before writing a new copy

I think Vishal was referring to restoring from backup.  cp itself will
truncate the file before overwriting, iirc.

>> To summarize, the two cases we want to handle are:
>> 1. Application has inbuilt recovery:
>>   - hits badblock
>>   - figures out it is able to recover the data
>>   - handles SIGBUS or EIO
>>   - does a (sector aligned) write() to restore the data
>
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it  can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.

I read that "figures out" step as the application determining whether or
not it had a redundant copy.

> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.

Do we even need that?  What if we added an FIEMAP flag for determining
bad blocks.  The file system could simply walk the list of extents for
the file and check the corresponding disk blocks.  No reverse mapping
required.  Also note that DM/MD don't support direct_access(), either,
so I don't think they're relevant for this discussion.

>> 2. Application doesn't have any inbuilt recovery mechanism
>>   - hits badblock
>>   - gets SIGBUS (or EIO) and crashes
>>   - Sysadmin restores file from backup
>
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
>
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.  multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
>
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably

Not all storage is configured into a RAID volume, and in some instances,
the application is better positioned to recover the data (gluster/ceph,
for example).  It really comes down to whether applications or libraries
will want to implement redundancy 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Dan Williams
On Tue, Apr 26, 2016 at 8:31 AM, Jan Kara  wrote:
> On Tue 26-04-16 07:59:10, Dan Williams wrote:
>> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner  wrote:
>> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
>> [..]
>> > It seems to me you are focussing on code/technologies that exist
>> > today instead of trying to define an architecture that is more
>> > optimal for pmem storage systems. Yes, working code is great, but if
>> > you can't tell people how things like robust error handling and
>> > redundancy are going to work in future then it's going to take
>> > forever for everyone else to handle such errors robustly through the
>> > storage stack...
>>
>> Precisely because higher order redundancy is built on top this baseline.
>>
>> MD-RAID can't do it's error recovery if we don't have -EIO and
>> clear-error-on-write.  On the other hand, you're absolutely right that
>> we have a gaping hole on top of the SIGBUS recovery model, and don't
>> have a kernel layer we can interpose on top of DAX to provide some
>> semblance of redundancy.
>>
>> In the meantime, a handful of applications with a team of full-time
>> site-reliability-engineers may be able to plug in external redundancy
>> infrastructure on top of what is defined in these patches.  For
>> everyone else, the hard problem, we need to do a lot more thinking
>> about a trap and recover solution.
>
> So we could actually implement some kind of redundancy with DAX with
> reasonable effort. We already do track dirty storage PFNs in the radix
> tree. After DAX locking patches get merged we also have a reliable way to
> write-protect them when we decide to do 'writeback' (translates to flushing
> CPU caches) for them. When we do that, we have all the infrastructure in
> place to provide 'stable pages' while some mirroring or other redundancy
> mechanism in kernel works with the data.
>
> But as Dave said, we should do some writeup of how this is all supposed to
> work and e.g. which layer is going to be responsible for the redundancy. Do
> we want to have that in DAX code? Or just provide stable page guarantees
> from DAX and do the redundancy from device mapper? This needs more
> thought...
>

[ adding Mike, since his ears are likely burning by this point ]

If we had the ability to specify a range or list of ranges to
blkdev_issue_flush() that would allow the driver level to implement
redundancy at sync time.  And no, before someone flies off the handle,
this isn't rehashing the same argument I lost about where to track
dirty pfns.  Rather this relies on the radix to track dirty pfns, but
asks the driver to do the flush operation.  In the nominal case this
is a clflush / clwb loop or wbinvd in the pmem driver, in the
redundancy case the pmem driver is swapped out for a driver that uses
the flush request as a trigger point to synchronize redundant data.

We want this at the driver level to take advantage of standard
asynchronous completions, and make it administratively equivalent to
the dm/md layering people are used to using.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Dan Williams
On Tue, Apr 26, 2016 at 8:31 AM, Jan Kara  wrote:
> On Tue 26-04-16 07:59:10, Dan Williams wrote:
>> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner  wrote:
>> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
>> [..]
>> > It seems to me you are focussing on code/technologies that exist
>> > today instead of trying to define an architecture that is more
>> > optimal for pmem storage systems. Yes, working code is great, but if
>> > you can't tell people how things like robust error handling and
>> > redundancy are going to work in future then it's going to take
>> > forever for everyone else to handle such errors robustly through the
>> > storage stack...
>>
>> Precisely because higher order redundancy is built on top this baseline.
>>
>> MD-RAID can't do it's error recovery if we don't have -EIO and
>> clear-error-on-write.  On the other hand, you're absolutely right that
>> we have a gaping hole on top of the SIGBUS recovery model, and don't
>> have a kernel layer we can interpose on top of DAX to provide some
>> semblance of redundancy.
>>
>> In the meantime, a handful of applications with a team of full-time
>> site-reliability-engineers may be able to plug in external redundancy
>> infrastructure on top of what is defined in these patches.  For
>> everyone else, the hard problem, we need to do a lot more thinking
>> about a trap and recover solution.
>
> So we could actually implement some kind of redundancy with DAX with
> reasonable effort. We already do track dirty storage PFNs in the radix
> tree. After DAX locking patches get merged we also have a reliable way to
> write-protect them when we decide to do 'writeback' (translates to flushing
> CPU caches) for them. When we do that, we have all the infrastructure in
> place to provide 'stable pages' while some mirroring or other redundancy
> mechanism in kernel works with the data.
>
> But as Dave said, we should do some writeup of how this is all supposed to
> work and e.g. which layer is going to be responsible for the redundancy. Do
> we want to have that in DAX code? Or just provide stable page guarantees
> from DAX and do the redundancy from device mapper? This needs more
> thought...
>

[ adding Mike, since his ears are likely burning by this point ]

If we had the ability to specify a range or list of ranges to
blkdev_issue_flush() that would allow the driver level to implement
redundancy at sync time.  And no, before someone flies off the handle,
this isn't rehashing the same argument I lost about where to track
dirty pfns.  Rather this relies on the radix to track dirty pfns, but
asks the driver to do the flush operation.  In the nominal case this
is a clflush / clwb loop or wbinvd in the pmem driver, in the
redundancy case the pmem driver is swapped out for a driver that uses
the flush request as a trigger point to synchronize redundant data.

We want this at the driver level to take advantage of standard
asynchronous completions, and make it administratively equivalent to
the dm/md layering people are used to using.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Jan Kara
On Tue 26-04-16 07:59:10, Dan Williams wrote:
> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
> [..]
> > It seems to me you are focussing on code/technologies that exist
> > today instead of trying to define an architecture that is more
> > optimal for pmem storage systems. Yes, working code is great, but if
> > you can't tell people how things like robust error handling and
> > redundancy are going to work in future then it's going to take
> > forever for everyone else to handle such errors robustly through the
> > storage stack...
> 
> Precisely because higher order redundancy is built on top this baseline.
> 
> MD-RAID can't do it's error recovery if we don't have -EIO and
> clear-error-on-write.  On the other hand, you're absolutely right that
> we have a gaping hole on top of the SIGBUS recovery model, and don't
> have a kernel layer we can interpose on top of DAX to provide some
> semblance of redundancy.
> 
> In the meantime, a handful of applications with a team of full-time
> site-reliability-engineers may be able to plug in external redundancy
> infrastructure on top of what is defined in these patches.  For
> everyone else, the hard problem, we need to do a lot more thinking
> about a trap and recover solution.

So we could actually implement some kind of redundancy with DAX with
reasonable effort. We already do track dirty storage PFNs in the radix
tree. After DAX locking patches get merged we also have a reliable way to
write-protect them when we decide to do 'writeback' (translates to flushing
CPU caches) for them. When we do that, we have all the infrastructure in
place to provide 'stable pages' while some mirroring or other redundancy
mechanism in kernel works with the data.

But as Dave said, we should do some writeup of how this is all supposed to
work and e.g. which layer is going to be responsible for the redundancy. Do
we want to have that in DAX code? Or just provide stable page guarantees
from DAX and do the redundancy from device mapper? This needs more
thought...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Jan Kara
On Tue 26-04-16 07:59:10, Dan Williams wrote:
> On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
> [..]
> > It seems to me you are focussing on code/technologies that exist
> > today instead of trying to define an architecture that is more
> > optimal for pmem storage systems. Yes, working code is great, but if
> > you can't tell people how things like robust error handling and
> > redundancy are going to work in future then it's going to take
> > forever for everyone else to handle such errors robustly through the
> > storage stack...
> 
> Precisely because higher order redundancy is built on top this baseline.
> 
> MD-RAID can't do it's error recovery if we don't have -EIO and
> clear-error-on-write.  On the other hand, you're absolutely right that
> we have a gaping hole on top of the SIGBUS recovery model, and don't
> have a kernel layer we can interpose on top of DAX to provide some
> semblance of redundancy.
> 
> In the meantime, a handful of applications with a team of full-time
> site-reliability-engineers may be able to plug in external redundancy
> infrastructure on top of what is defined in these patches.  For
> everyone else, the hard problem, we need to do a lot more thinking
> about a trap and recover solution.

So we could actually implement some kind of redundancy with DAX with
reasonable effort. We already do track dirty storage PFNs in the radix
tree. After DAX locking patches get merged we also have a reliable way to
write-protect them when we decide to do 'writeback' (translates to flushing
CPU caches) for them. When we do that, we have all the infrastructure in
place to provide 'stable pages' while some mirroring or other redundancy
mechanism in kernel works with the data.

But as Dave said, we should do some writeup of how this is all supposed to
work and e.g. which layer is going to be responsible for the redundancy. Do
we want to have that in DAX code? Or just provide stable page guarantees
from DAX and do the redundancy from device mapper? This needs more
thought...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Dan Williams
On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
[..]
> It seems to me you are focussing on code/technologies that exist
> today instead of trying to define an architecture that is more
> optimal for pmem storage systems. Yes, working code is great, but if
> you can't tell people how things like robust error handling and
> redundancy are going to work in future then it's going to take
> forever for everyone else to handle such errors robustly through the
> storage stack...

Precisely because higher order redundancy is built on top this baseline.

MD-RAID can't do it's error recovery if we don't have -EIO and
clear-error-on-write.  On the other hand, you're absolutely right that
we have a gaping hole on top of the SIGBUS recovery model, and don't
have a kernel layer we can interpose on top of DAX to provide some
semblance of redundancy.

In the meantime, a handful of applications with a team of full-time
site-reliability-engineers may be able to plug in external redundancy
infrastructure on top of what is defined in these patches.  For
everyone else, the hard problem, we need to do a lot more thinking
about a trap and recover solution.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Dan Williams
On Tue, Apr 26, 2016 at 1:27 AM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
[..]
> It seems to me you are focussing on code/technologies that exist
> today instead of trying to define an architecture that is more
> optimal for pmem storage systems. Yes, working code is great, but if
> you can't tell people how things like robust error handling and
> redundancy are going to work in future then it's going to take
> forever for everyone else to handle such errors robustly through the
> storage stack...

Precisely because higher order redundancy is built on top this baseline.

MD-RAID can't do it's error recovery if we don't have -EIO and
clear-error-on-write.  On the other hand, you're absolutely right that
we have a gaping hole on top of the SIGBUS recovery model, and don't
have a kernel layer we can interpose on top of DAX to provide some
semblance of redundancy.

In the meantime, a handful of applications with a team of full-time
site-reliability-engineers may be able to plug in external redundancy
infrastructure on top of what is defined in these patches.  For
everyone else, the hard problem, we need to do a lot more thinking
about a trap and recover solution.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Vishal Verma
On Tue, 2016-04-26 at 01:33 -0700, h...@infradead.org wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> > 
> > - Application hits EIO doing dax_IO or load/store io
> > 
> > - It checks badblocks and discovers it's files have lost data
> > 
> > - It write()s those sectors (possibly converted to file offsets
> > using
> > fiemap)
> > ?? ?? * This triggers the fallback path, but if the application is
> > doing
> > this level of recovery, it will know the sector is bad, and write
> > the
> > entire sector
> This sounds like a mess.
> 
> > 
> > I think if we want to keep allowing arbitrary alignments for the
> > dax_do_io path, we'd need:
> > 1. To represent badblocks at a finer granularity (likely cache
> > lines)
> > 2. To allow the driver to do IO to a *block device* at sub-sector
> > granularity
> It's not a block device if it supports DAX.  It's byte addressable
> memory masquerading as a block device.

Yes but we made that decision a while back with pmem :)
Are you saying it should stop being a block device anymore?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Vishal Verma
On Tue, 2016-04-26 at 01:33 -0700, h...@infradead.org wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> > 
> > - Application hits EIO doing dax_IO or load/store io
> > 
> > - It checks badblocks and discovers it's files have lost data
> > 
> > - It write()s those sectors (possibly converted to file offsets
> > using
> > fiemap)
> > ?? ?? * This triggers the fallback path, but if the application is
> > doing
> > this level of recovery, it will know the sector is bad, and write
> > the
> > entire sector
> This sounds like a mess.
> 
> > 
> > I think if we want to keep allowing arbitrary alignments for the
> > dax_do_io path, we'd need:
> > 1. To represent badblocks at a finer granularity (likely cache
> > lines)
> > 2. To allow the driver to do IO to a *block device* at sub-sector
> > granularity
> It's not a block device if it supports DAX.  It's byte addressable
> memory masquerading as a block device.

Yes but we made that decision a while back with pmem :)
Are you saying it should stop being a block device anymore?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-
> block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Vishal Verma
On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>

> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files.  users, backup programs, data replicators, fileystem
> re-organisers (e.g.  defragmenters) etc all may access the files and
> they may throw errors. What then?

In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..

> 

<>

> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
> 
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
> 
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.

<>
> 
> > 
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> >   - hits badblock
> >   - figures out it is able to recover the data
> >   - handles SIGBUS or EIO
> >   - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it  can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.

Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.

> 
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
> 
> > 
> > 2. Application doesn't have any inbuilt recovery mechanism
> >   - hits badblock
> >   - gets SIGBUS (or EIO) and crashes
> >   - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
> 
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.  multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
> 
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably

Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.

Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.

So, yes, this is very much a DR case in our current pmem+dax
architecture, 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Vishal Verma
On Tue, 2016-04-26 at 10:41 +1000, Dave Chinner wrote:
> <>

> > The application doesn't have to scan the entire filesystem, but
> > presumably it knows what files it 'owns', and does a fiemap for
> > those.
> You're assuming that only the DAX aware application accesses it's
> files.  users, backup programs, data replicators, fileystem
> re-organisers (e.g.  defragmenters) etc all may access the files and
> they may throw errors. What then?

In this scenario, backup applications etc that try to read that data
before it has been replaced will just hit the errors and fail..

> 

<>

> > The data that was lost is gone -- this assumes the application has
> > some
> > ability to recover using a journal/log or other redundancy - yes,
> > at the
> > application layer. If it doesn't have this sort of capability, the
> > only
> > option is to restore files from a backup/mirror.
> So the architecture has a built in assumption that only userspace
> can handle data loss?
> 
> What about filesytsems like NOVA, that use log structured design to
> provide DAX w/ update atomicity and can potentially also provide
> redundancy/repair through the same mechanisms? Won't pmem native
> filesystems with built in data protection features like this remove
> the need for adding all this to userspace applications?
> 
> If so, shouldn't that be the focus of development rahter than
> placing the burden on userspace apps to handle storage repair
> situations?

Agreed that file systems like NOVA can be designed to handle this
better, but haven't you said in the past that it may take years for a
new file system to become production ready, and that DAX is the until-
then solution that gets us most of the way there.. I think we just want
to ensure that current-DAX has some way to deal with errors, and these
patches provide an admin-intervention recovery path and possibly
another if the app wants to try something fancy for recovery.

<>
> 
> > 
> > To summarize, the two cases we want to handle are:
> > 1. Application has inbuilt recovery:
> >   - hits badblock
> >   - figures out it is able to recover the data
> >   - handles SIGBUS or EIO
> >   - does a (sector aligned) write() to restore the data
> The "figures out" step here is where >95% of the work we'd have to
> do is. And that's in filesystem and block layer code, not
> userspace, and userspace can't do that work in a signal handler.
> And it  can still fall down to the second case when the application
> doesn't have another copy of the data somewhere.

Ah when I said "figures out" I was only thinking if the application has
some redundancy/jouranlling, and if it can recover using that -- not
additional recovery mechanisms at the block/fs layer.

> 
> FWIW, we don't have a DAX enabled filesystem that can do
> reverse block mapping, so we're a year or two away from this being a
> workable production solution from the filesystem perspective. And
> AFAICT, it's not even on the roadmap for dm/md layers.
> 
> > 
> > 2. Application doesn't have any inbuilt recovery mechanism
> >   - hits badblock
> >   - gets SIGBUS (or EIO) and crashes
> >   - Sysadmin restores file from backup
> Which is no different to an existing non-DAX application getting an
> EIO/sigbus from current storage technologies.
> 
> Except: in the existing storage stack, redundancy and correction has
> already had to have failed for the application to see such an error.
> Hence this is normally considered a DR case as there's had to be
> cascading failures (e.g.  multiple disk failures in a RAID) to get
> to this stage, not a single error in a single sector in
> non-redundant storage.
> 
> We need some form of redundancy and correction in the PMEM stack to
> prevent single sector errors from taking down services until an
> administrator can correct the problem. I'm trying to understand
> where this is supposed to fit into the picture - at this point I
> really don't think userspace applications are going to be able to do
> this reliably

Agreed that the pmem stack could use more redundancy and error
correction, perhaps enabling md-raid to raid pmem devices and then
enable DAX on top of that and we'll have a better chance to handle
errors, but that level of recovery isn't what these patches are aiming
for -- that is obviously a longer term effort. These simply aim to
provide that disaster recovery path when a single sector failure does
take down the service.

Today, on a dax enabled filesystem, if/when the app hits an error and
crashes, dax is simply disabled till the errors are gone. This is
obviously less than ideal. (This was done because there is currently no
way for a DAX file system to send any IO - mmap or otherwise - through
the driver, including zeroing of new fs blocks). These patches enable
the DR path by allowing some non-mmap IO (most importantly zeroing) to
go through the driver which can tell the device to do some remapping
etc.

So, yes, this is very much a DR case in our current pmem+dax
architecture, 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> - Application hits EIO doing dax_IO or load/store io
> 
> - It checks badblocks and discovers it's files have lost data
> 
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> ?? ?? * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector

This sounds like a mess.

> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity

It's not a block device if it supports DAX.  It's byte addressable
memory masquerading as a block device.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> - Application hits EIO doing dax_IO or load/store io
> 
> - It checks badblocks and discovers it's files have lost data
> 
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> ?? ?? * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector

This sounds like a mess.

> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity

It's not a block device if it supports DAX.  It's byte addressable
memory masquerading as a block device.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 11:32:08AM -0400, Jeff Moyer wrote:
> > EINVAL is a concern here.  Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> >
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> 
> The only callers of dax_do_io are direct_IO methods.

They are because the DAX I/O pass is a mess, but that doesn't mean
the user specific O_DIRECT on the open nessecarily.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread h...@infradead.org
On Mon, Apr 25, 2016 at 11:32:08AM -0400, Jeff Moyer wrote:
> > EINVAL is a concern here.  Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> >
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> 
> The only callers of dax_do_io are direct_IO methods.

They are because the DAX I/O pass is a mess, but that doesn't mean
the user specific O_DIRECT on the open nessecarily.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Dave Chinner
On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 7:56 PM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
> >> > I haven't seen any design/documentation for infrastructure at the
> >> > application layer to handle redundant data and correctly
> >> > transparently so I don't have any idea what the technical
> >> > requirements this different IO stack places on filesystems may be.
> >> > Hence I'm asking for some kind of architecture/design documentation
> >> > that I can read to understand exactly what is being proposed here...
> >>
> >> I think this is a discussion for a solution that would build on top of
> >> this basic "here are the errors, re-write them with good data if you
> >> can; otherwise, best of luck" foundation.  Something like a DAX-aware
> >> device mapper layer that duplicates data tagged with REQ_META so at
> >> least we have a recovery path when a sector error lands in critical
> >> filesystem-metadata.
> >
> > Filesytsem metadata is not the topic of discussion here - it's
> > user data that throws an error on a DAX load/store that is the
> > issue.
> 
> Which is not a new problem since volatile DRAM in the non-DAX case can
> throw the exact same error.

They are not the same class of error, not by a long shot.

The "bad page in page cache" error on traditional storage means data
is not lost - the original copy still in whatever storage medium
that the cached page was filled from. i.e. Re-read the file and the
data is still there, which is no different to crashing and
restarting that machine and losing whatever writes had not been
committed to stable storage..

In the pmem case, a "bad page" is a permanent loss of data - it's
unrecoverable without some form data recovery operation being
performed on the storage.

> The current recovery model there is crash
> the kernel (without MCE recovery),

Ouch. Permanent data loss and a system wide DoS.

> or crash the application and hope
> the kernel maps out the page or the application knows how to restart
> after SIGBUS. 

Not much better - neither provide a mechanism for recovery.

> Memory mirroring is meant to make this a bit less
> harsh, but there's no mechanism to make this available outside the
> kernel.

Which implies that we need a DM module that interfaces with the
hardware memory mirroring to perform recovery and remapping
operations. i.e. in the traditional storage stack location.

> >> However, anything we come up with to make NVDIMM
> >> errors more survivable should be directly applicable to traditional
> >> disk storage as well.
> >
> > I'm not sure it does. DAX implies that traditional block layer RAID
> > infrastructure is not possible, nor are data CRCs, nor are any other
> > sort of data transformations that are needed for redundancy at the
> > device layers. Anything that relies on copying/modifying/stable data to
> > provide redundancies needs to do such work at a place where it can
> > stall userspace page faults.
> >
> > This is where pmem native filesystem designs like NOVA take over
> > from traditional block based filesystems - they are designed around
> > the ability to do atomic page-based operations for data protection
> > and recovery operations. It is this mechanism that allows stable
> > pages to be committed to permanent storage and as such, allow
> > redundancy operations such as mirroring to be performed before
> > operations are marked as "stable".
> >
> > I'm missing the bigger picture that is being aimed at here - what's the
> > point of DAX if we have to turn it off if we want any sort of
> > failure protection? What's the big plan for fully enabling DAX with
> > robust error correction? Where is this all supposed to be leading
> > to?
> >
> 
> NOVA and other solutions are free and encouraged to do a coherent
> bottoms-up rethink of error handling on top of persistent memory
> devices, in the meantime applications can only expect the legacy
> SIGBUS and -EIO mechanisms are available.  So I'm still trying to
> connect how the "What would NOVA do?" discussion is anything but
> orthogonal to hooking up SIGBUS and -EIO for traditional-filesystem
> DAX.  It's the only error model an application can expect because it's
> the only one that currently exists.



Yes, I get that. I'm not interested in the resultant fatal error
delivery - I'm asking about what happens between the memory error
and the delivery of the fatal "we've lost your data forever" error
that gets delivered to userspace. i.e. I'm after  a description of
how error correction/recovery is supposed to be applied to DAX
*before we report SIGBUS or EIO* to the application.

What is the plan/model/vision for intercepting MCEs and recovering
from them? e.g. how do we going to pull the good copy from
hardware/software memory mirrors? What layer is supposed to be
responsible for that? Is it different for hardware mirroring
compared to a more traditional 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-26 Thread Dave Chinner
On Mon, Apr 25, 2016 at 09:18:42PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 7:56 PM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
> >> > I haven't seen any design/documentation for infrastructure at the
> >> > application layer to handle redundant data and correctly
> >> > transparently so I don't have any idea what the technical
> >> > requirements this different IO stack places on filesystems may be.
> >> > Hence I'm asking for some kind of architecture/design documentation
> >> > that I can read to understand exactly what is being proposed here...
> >>
> >> I think this is a discussion for a solution that would build on top of
> >> this basic "here are the errors, re-write them with good data if you
> >> can; otherwise, best of luck" foundation.  Something like a DAX-aware
> >> device mapper layer that duplicates data tagged with REQ_META so at
> >> least we have a recovery path when a sector error lands in critical
> >> filesystem-metadata.
> >
> > Filesytsem metadata is not the topic of discussion here - it's
> > user data that throws an error on a DAX load/store that is the
> > issue.
> 
> Which is not a new problem since volatile DRAM in the non-DAX case can
> throw the exact same error.

They are not the same class of error, not by a long shot.

The "bad page in page cache" error on traditional storage means data
is not lost - the original copy still in whatever storage medium
that the cached page was filled from. i.e. Re-read the file and the
data is still there, which is no different to crashing and
restarting that machine and losing whatever writes had not been
committed to stable storage..

In the pmem case, a "bad page" is a permanent loss of data - it's
unrecoverable without some form data recovery operation being
performed on the storage.

> The current recovery model there is crash
> the kernel (without MCE recovery),

Ouch. Permanent data loss and a system wide DoS.

> or crash the application and hope
> the kernel maps out the page or the application knows how to restart
> after SIGBUS. 

Not much better - neither provide a mechanism for recovery.

> Memory mirroring is meant to make this a bit less
> harsh, but there's no mechanism to make this available outside the
> kernel.

Which implies that we need a DM module that interfaces with the
hardware memory mirroring to perform recovery and remapping
operations. i.e. in the traditional storage stack location.

> >> However, anything we come up with to make NVDIMM
> >> errors more survivable should be directly applicable to traditional
> >> disk storage as well.
> >
> > I'm not sure it does. DAX implies that traditional block layer RAID
> > infrastructure is not possible, nor are data CRCs, nor are any other
> > sort of data transformations that are needed for redundancy at the
> > device layers. Anything that relies on copying/modifying/stable data to
> > provide redundancies needs to do such work at a place where it can
> > stall userspace page faults.
> >
> > This is where pmem native filesystem designs like NOVA take over
> > from traditional block based filesystems - they are designed around
> > the ability to do atomic page-based operations for data protection
> > and recovery operations. It is this mechanism that allows stable
> > pages to be committed to permanent storage and as such, allow
> > redundancy operations such as mirroring to be performed before
> > operations are marked as "stable".
> >
> > I'm missing the bigger picture that is being aimed at here - what's the
> > point of DAX if we have to turn it off if we want any sort of
> > failure protection? What's the big plan for fully enabling DAX with
> > robust error correction? Where is this all supposed to be leading
> > to?
> >
> 
> NOVA and other solutions are free and encouraged to do a coherent
> bottoms-up rethink of error handling on top of persistent memory
> devices, in the meantime applications can only expect the legacy
> SIGBUS and -EIO mechanisms are available.  So I'm still trying to
> connect how the "What would NOVA do?" discussion is anything but
> orthogonal to hooking up SIGBUS and -EIO for traditional-filesystem
> DAX.  It's the only error model an application can expect because it's
> the only one that currently exists.



Yes, I get that. I'm not interested in the resultant fatal error
delivery - I'm asking about what happens between the memory error
and the delivery of the fatal "we've lost your data forever" error
that gets delivered to userspace. i.e. I'm after  a description of
how error correction/recovery is supposed to be applied to DAX
*before we report SIGBUS or EIO* to the application.

What is the plan/model/vision for intercepting MCEs and recovering
from them? e.g. how do we going to pull the good copy from
hardware/software memory mirrors? What layer is supposed to be
responsible for that? Is it different for hardware mirroring
compared to a more traditional software dm-RAID1 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 7:56 PM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
[..]
>> Otherwise, if an application wants to use DAX then it might
>> need to be prepared to handle media errors itself same as the
>> un-RAIDed disk case.  Yes, at an administrative level without
>> reverse-mapping support from a filesystem there's presently no way to
>> ask "which files on this fs are impacted by media errors", and we're
>> aware that reverse-mapping capabilities are nascent for current
>> DAX-aware filesystems.
>
> Precisely my point - suggestions are being proposed which assume
> use of infrastructure that *does not exist yet* and has not been
> discussed or documented. If we're expecting such infrastructure to
> be implemented in the filesystems and block device drivers, then we
> need to determine that the error model actually works first...

These patches only assume the clear-error-on write-model, and that
*maybe* the sysfs bad blocks list is useful if the filesystem has a
reverse-map, or if the application can compare the list against the
results of fiemap().  Beyond that, this is the same perennial "we
should really have better error coordination between block device and
filesystems" discussions that we have at LSF.

>
>> The forward lookup path, as impractical as it
>> is for large numbers of files, is available if an application wanted
>> to know if a specific file was impacted.  We've discussed possibly
>> extending fiemap() to return bad blocks in a file rather than
>> consulting sysfs, or extending lseek() with something like SEEK_ERROR
>> to return offsets of bad areas in a file.
>
> Via what infrastructure will the filesystem use for finding out
> whether a file has bad blocks in it? And if the file does have bad
> blocks, what are you expecting the filesystem to do with that
> information?

We currently have no expectation that the filesystem does anything
with the bad blocks list.  However, if a filesystem had btrfs-like
capabilities to recover data from a redundant location we'd be looking
to plug into that infrastructure.

>> > I haven't seen any design/documentation for infrastructure at the
>> > application layer to handle redundant data and correctly
>> > transparently so I don't have any idea what the technical
>> > requirements this different IO stack places on filesystems may be.
>> > Hence I'm asking for some kind of architecture/design documentation
>> > that I can read to understand exactly what is being proposed here...
>>
>> I think this is a discussion for a solution that would build on top of
>> this basic "here are the errors, re-write them with good data if you
>> can; otherwise, best of luck" foundation.  Something like a DAX-aware
>> device mapper layer that duplicates data tagged with REQ_META so at
>> least we have a recovery path when a sector error lands in critical
>> filesystem-metadata.
>
> Filesytsem metadata is not the topic of discussion here - it's
> user data that throws an error on a DAX load/store that is the
> issue.

Which is not a new problem since volatile DRAM in the non-DAX case can
throw the exact same error.  The current recovery model there is crash
the kernel (without MCE recovery), or crash the application and hope
the kernel maps out the page or the application knows how to restart
after SIGBUS.  Memory mirroring is meant to make this a bit less
harsh, but there's no mechanism to make this available outside the
kernel.

>> However, anything we come up with to make NVDIMM
>> errors more survivable should be directly applicable to traditional
>> disk storage as well.
>
> I'm not sure it does. DAX implies that traditional block layer RAID
> infrastructure is not possible, nor are data CRCs, nor are any other
> sort of data transformations that are needed for redundancy at the
> device layers. Anything that relies on copying/modifying/stable data to
> provide redundancies needs to do such work at a place where it can
> stall userspace page faults.
>
> This is where pmem native filesystem designs like NOVA take over
> from traditional block based filesystems - they are designed around
> the ability to do atomic page-based operations for data protection
> and recovery operations. It is this mechanism that allows stable
> pages to be committed to permanent storage and as such, allow
> redundancy operations such as mirroring to be performed before
> operations are marked as "stable".
>
> I'm missing the bigger picture that is being aimed at here - what's the
> point of DAX if we have to turn it off if we want any sort of
> failure protection? What's the big plan for fully enabling DAX with
> robust error correction? Where is this all supposed to be leading
> to?
>

NOVA and other solutions are free and encouraged to do a coherent
bottoms-up rethink of error handling on top of persistent memory
devices, in the meantime applications can only expect the legacy
SIGBUS and -EIO mechanisms are 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 7:56 PM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
[..]
>> Otherwise, if an application wants to use DAX then it might
>> need to be prepared to handle media errors itself same as the
>> un-RAIDed disk case.  Yes, at an administrative level without
>> reverse-mapping support from a filesystem there's presently no way to
>> ask "which files on this fs are impacted by media errors", and we're
>> aware that reverse-mapping capabilities are nascent for current
>> DAX-aware filesystems.
>
> Precisely my point - suggestions are being proposed which assume
> use of infrastructure that *does not exist yet* and has not been
> discussed or documented. If we're expecting such infrastructure to
> be implemented in the filesystems and block device drivers, then we
> need to determine that the error model actually works first...

These patches only assume the clear-error-on write-model, and that
*maybe* the sysfs bad blocks list is useful if the filesystem has a
reverse-map, or if the application can compare the list against the
results of fiemap().  Beyond that, this is the same perennial "we
should really have better error coordination between block device and
filesystems" discussions that we have at LSF.

>
>> The forward lookup path, as impractical as it
>> is for large numbers of files, is available if an application wanted
>> to know if a specific file was impacted.  We've discussed possibly
>> extending fiemap() to return bad blocks in a file rather than
>> consulting sysfs, or extending lseek() with something like SEEK_ERROR
>> to return offsets of bad areas in a file.
>
> Via what infrastructure will the filesystem use for finding out
> whether a file has bad blocks in it? And if the file does have bad
> blocks, what are you expecting the filesystem to do with that
> information?

We currently have no expectation that the filesystem does anything
with the bad blocks list.  However, if a filesystem had btrfs-like
capabilities to recover data from a redundant location we'd be looking
to plug into that infrastructure.

>> > I haven't seen any design/documentation for infrastructure at the
>> > application layer to handle redundant data and correctly
>> > transparently so I don't have any idea what the technical
>> > requirements this different IO stack places on filesystems may be.
>> > Hence I'm asking for some kind of architecture/design documentation
>> > that I can read to understand exactly what is being proposed here...
>>
>> I think this is a discussion for a solution that would build on top of
>> this basic "here are the errors, re-write them with good data if you
>> can; otherwise, best of luck" foundation.  Something like a DAX-aware
>> device mapper layer that duplicates data tagged with REQ_META so at
>> least we have a recovery path when a sector error lands in critical
>> filesystem-metadata.
>
> Filesytsem metadata is not the topic of discussion here - it's
> user data that throws an error on a DAX load/store that is the
> issue.

Which is not a new problem since volatile DRAM in the non-DAX case can
throw the exact same error.  The current recovery model there is crash
the kernel (without MCE recovery), or crash the application and hope
the kernel maps out the page or the application knows how to restart
after SIGBUS.  Memory mirroring is meant to make this a bit less
harsh, but there's no mechanism to make this available outside the
kernel.

>> However, anything we come up with to make NVDIMM
>> errors more survivable should be directly applicable to traditional
>> disk storage as well.
>
> I'm not sure it does. DAX implies that traditional block layer RAID
> infrastructure is not possible, nor are data CRCs, nor are any other
> sort of data transformations that are needed for redundancy at the
> device layers. Anything that relies on copying/modifying/stable data to
> provide redundancies needs to do such work at a place where it can
> stall userspace page faults.
>
> This is where pmem native filesystem designs like NOVA take over
> from traditional block based filesystems - they are designed around
> the ability to do atomic page-based operations for data protection
> and recovery operations. It is this mechanism that allows stable
> pages to be committed to permanent storage and as such, allow
> redundancy operations such as mirroring to be performed before
> operations are marked as "stable".
>
> I'm missing the bigger picture that is being aimed at here - what's the
> point of DAX if we have to turn it off if we want any sort of
> failure protection? What's the big plan for fully enabling DAX with
> robust error correction? Where is this all supposed to be leading
> to?
>

NOVA and other solutions are free and encouraged to do a coherent
bottoms-up rethink of error handling on top of persistent memory
devices, in the meantime applications can only expect the legacy
SIGBUS and -EIO mechanisms are available.  So I'm still 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 5:11 PM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
> [..]
> >> Maybe I missed something, but all these assumptions are already
> >> present for typical block devices, i.e. sectors may go bad and a write
> >> may make the sector usable again.
> >
> > The assumption we make about sectors going bad on SSDs or SRDs is
> > that the device is about to die and needs replacing ASAP.
> 
> Similar assumptions here.  Storage media is experiencing errors and
> past a certain threshold it may be time to decommission the device.
> 
> You can see definitions for SMART / media health commands from various
> vendors at these links, and yes, hopefully these are standardized /
> unified at some point down the road:
> 
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/NFIT_DSM_DDR4_NVDIMM-N_v84s.pdf
> https://msdn.microsoft.com/en-us/library/windows/hardware/mt604717(v=vs.85).aspx
> 
> 
> > Then
> > RAID takes care of the rebuild completely transparently. i.e.
> > handling and correcting bad sectors is typically done completely
> > transparently /below/ the filesytem like so:
> 
> Again, same for an NVDIMM.  Use the pmem block-device as a RAID-member device.

Which means we're not using DAX and so the existing storage model
applies. I understand how this works.

What I'm asking about the redundancy/error correction model /when
using DAX/ and a userspace DAX load/store throws the MCE.

> > And somehow all the error information from the hardware layer needs
> > to be propagated up to the application layer, along with all the
> > mapping information from the filesystem and block layers for the
> > application to make sense of the hardware reported errors.
> >
> > I see assumptions this this "just works" but we don't have any of
> > the relevant APIs or infrastructure to enable the application to do
> > the hardware error->file+offset namespace mapping (i.e. filesystem
> > reverse mapping for for file offsets and directory paths, and
> > reverse mapping for the the block layer remapping drivers).
> 
> If an application expects errors to be handled beneath the filesystem
> then it should forgo DAX and arrange for the NVDIMM devices to be
> RAIDed.

See above: I'm asking about the DAX-enabled error handling model,
not the traditional error handling model.

> Otherwise, if an application wants to use DAX then it might
> need to be prepared to handle media errors itself same as the
> un-RAIDed disk case.  Yes, at an administrative level without
> reverse-mapping support from a filesystem there's presently no way to
> ask "which files on this fs are impacted by media errors", and we're
> aware that reverse-mapping capabilities are nascent for current
> DAX-aware filesystems.

Precisely my point - suggestions are being proposed which assume
use of infrastructure that *does not exist yet* and has not been
discussed or documented. If we're expecting such infrastructure to
be implemented in the filesystems and block device drivers, then we
need to determine that the error model actually works first...

> The forward lookup path, as impractical as it
> is for large numbers of files, is available if an application wanted
> to know if a specific file was impacted.  We've discussed possibly
> extending fiemap() to return bad blocks in a file rather than
> consulting sysfs, or extending lseek() with something like SEEK_ERROR
> to return offsets of bad areas in a file.

Via what infrastructure will the filesystem use for finding out
whether a file has bad blocks in it? And if the file does have bad
blocks, what are you expecting the filesystem to do with that
information?

> > I haven't seen any design/documentation for infrastructure at the
> > application layer to handle redundant data and correctly
> > transparently so I don't have any idea what the technical
> > requirements this different IO stack places on filesystems may be.
> > Hence I'm asking for some kind of architecture/design documentation
> > that I can read to understand exactly what is being proposed here...
> 
> I think this is a discussion for a solution that would build on top of
> this basic "here are the errors, re-write them with good data if you
> can; otherwise, best of luck" foundation.  Something like a DAX-aware
> device mapper layer that duplicates data tagged with REQ_META so at
> least we have a recovery path when a sector error lands in critical
> filesystem-metadata. 

Filesytsem metadata is not the topic of discussion here - it's
user data that throws an error on a DAX load/store that is the
issue.

> However, anything we come up with to make NVDIMM
> errors more survivable should be directly applicable to traditional
> disk storage as well.

I'm not sure it does. DAX implies that traditional block layer RAID
infrastructure is not 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 06:45:08PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 5:11 PM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
> [..]
> >> Maybe I missed something, but all these assumptions are already
> >> present for typical block devices, i.e. sectors may go bad and a write
> >> may make the sector usable again.
> >
> > The assumption we make about sectors going bad on SSDs or SRDs is
> > that the device is about to die and needs replacing ASAP.
> 
> Similar assumptions here.  Storage media is experiencing errors and
> past a certain threshold it may be time to decommission the device.
> 
> You can see definitions for SMART / media health commands from various
> vendors at these links, and yes, hopefully these are standardized /
> unified at some point down the road:
> 
> http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/NFIT_DSM_DDR4_NVDIMM-N_v84s.pdf
> https://msdn.microsoft.com/en-us/library/windows/hardware/mt604717(v=vs.85).aspx
> 
> 
> > Then
> > RAID takes care of the rebuild completely transparently. i.e.
> > handling and correcting bad sectors is typically done completely
> > transparently /below/ the filesytem like so:
> 
> Again, same for an NVDIMM.  Use the pmem block-device as a RAID-member device.

Which means we're not using DAX and so the existing storage model
applies. I understand how this works.

What I'm asking about the redundancy/error correction model /when
using DAX/ and a userspace DAX load/store throws the MCE.

> > And somehow all the error information from the hardware layer needs
> > to be propagated up to the application layer, along with all the
> > mapping information from the filesystem and block layers for the
> > application to make sense of the hardware reported errors.
> >
> > I see assumptions this this "just works" but we don't have any of
> > the relevant APIs or infrastructure to enable the application to do
> > the hardware error->file+offset namespace mapping (i.e. filesystem
> > reverse mapping for for file offsets and directory paths, and
> > reverse mapping for the the block layer remapping drivers).
> 
> If an application expects errors to be handled beneath the filesystem
> then it should forgo DAX and arrange for the NVDIMM devices to be
> RAIDed.

See above: I'm asking about the DAX-enabled error handling model,
not the traditional error handling model.

> Otherwise, if an application wants to use DAX then it might
> need to be prepared to handle media errors itself same as the
> un-RAIDed disk case.  Yes, at an administrative level without
> reverse-mapping support from a filesystem there's presently no way to
> ask "which files on this fs are impacted by media errors", and we're
> aware that reverse-mapping capabilities are nascent for current
> DAX-aware filesystems.

Precisely my point - suggestions are being proposed which assume
use of infrastructure that *does not exist yet* and has not been
discussed or documented. If we're expecting such infrastructure to
be implemented in the filesystems and block device drivers, then we
need to determine that the error model actually works first...

> The forward lookup path, as impractical as it
> is for large numbers of files, is available if an application wanted
> to know if a specific file was impacted.  We've discussed possibly
> extending fiemap() to return bad blocks in a file rather than
> consulting sysfs, or extending lseek() with something like SEEK_ERROR
> to return offsets of bad areas in a file.

Via what infrastructure will the filesystem use for finding out
whether a file has bad blocks in it? And if the file does have bad
blocks, what are you expecting the filesystem to do with that
information?

> > I haven't seen any design/documentation for infrastructure at the
> > application layer to handle redundant data and correctly
> > transparently so I don't have any idea what the technical
> > requirements this different IO stack places on filesystems may be.
> > Hence I'm asking for some kind of architecture/design documentation
> > that I can read to understand exactly what is being proposed here...
> 
> I think this is a discussion for a solution that would build on top of
> this basic "here are the errors, re-write them with good data if you
> can; otherwise, best of luck" foundation.  Something like a DAX-aware
> device mapper layer that duplicates data tagged with REQ_META so at
> least we have a recovery path when a sector error lands in critical
> filesystem-metadata. 

Filesytsem metadata is not the topic of discussion here - it's
user data that throws an error on a DAX load/store that is the
issue.

> However, anything we come up with to make NVDIMM
> errors more survivable should be directly applicable to traditional
> disk storage as well.

I'm not sure it does. DAX implies that traditional block layer RAID
infrastructure is not possible, nor are 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 5:11 PM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
[..]
>> Maybe I missed something, but all these assumptions are already
>> present for typical block devices, i.e. sectors may go bad and a write
>> may make the sector usable again.
>
> The assumption we make about sectors going bad on SSDs or SRDs is
> that the device is about to die and needs replacing ASAP.

Similar assumptions here.  Storage media is experiencing errors and
past a certain threshold it may be time to decommission the device.

You can see definitions for SMART / media health commands from various
vendors at these links, and yes, hopefully these are standardized /
unified at some point down the road:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/NFIT_DSM_DDR4_NVDIMM-N_v84s.pdf
https://msdn.microsoft.com/en-us/library/windows/hardware/mt604717(v=vs.85).aspx


> Then
> RAID takes care of the rebuild completely transparently. i.e.
> handling and correcting bad sectors is typically done completely
> transparently /below/ the filesytem like so:

Again, same for an NVDIMM.  Use the pmem block-device as a RAID-member device.

>> This patch series is extending that
>> out to the DAX-mmap case, but it's the same principle of "write to
>> clear error" that we live with in the block-I/O path.  What
>> clarification are you looking for beyond that point?
>
> I'm asking for an actual design document that explains how moving
> all the redundancy and bad sector correction stuff from the LBA
> layer up into application space is supposed to work when
> applications have no clue about LBA mappings, nor tend to keep
> redundant data around. i.e. you're proposing this:

These patches are not proposing *new* / general infrastructure for
moving redundancy and bad sector correction handling to userspace.  If
an existing app is somehow dealing with raw (without RAID) device
errors on disk storage media today it should not need to change to
handle errors on an NVDIMM.  My expectation is that very few if any
applications handle this today and just fail in the presence of media
errors.

> Application
> Application data redundancy/correction
> Filesystem
> Block
> [LBA mapping/redundancy/correction driver e.g. md/dm]
> driver
> hardware
>
> And somehow all the error information from the hardware layer needs
> to be propagated up to the application layer, along with all the
> mapping information from the filesystem and block layers for the
> application to make sense of the hardware reported errors.
>
> I see assumptions this this "just works" but we don't have any of
> the relevant APIs or infrastructure to enable the application to do
> the hardware error->file+offset namespace mapping (i.e. filesystem
> reverse mapping for for file offsets and directory paths, and
> reverse mapping for the the block layer remapping drivers).

If an application expects errors to be handled beneath the filesystem
then it should forgo DAX and arrange for the NVDIMM devices to be
RAIDed.  Otherwise, if an application wants to use DAX then it might
need to be prepared to handle media errors itself same as the
un-RAIDed disk case.  Yes, at an administrative level without
reverse-mapping support from a filesystem there's presently no way to
ask "which files on this fs are impacted by media errors", and we're
aware that reverse-mapping capabilities are nascent for current
DAX-aware filesystems.  The forward lookup path, as impractical as it
is for large numbers of files, is available if an application wanted
to know if a specific file was impacted.  We've discussed possibly
extending fiemap() to return bad blocks in a file rather than
consulting sysfs, or extending lseek() with something like SEEK_ERROR
to return offsets of bad areas in a file.

> I haven't seen any design/documentation for infrastructure at the
> application layer to handle redundant data and correctly
> transparently so I don't have any idea what the technical
> requirements this different IO stack places on filesystems may be.
> Hence I'm asking for some kind of architecture/design documentation
> that I can read to understand exactly what is being proposed here...

I think this is a discussion for a solution that would build on top of
this basic "here are the errors, re-write them with good data if you
can; otherwise, best of luck" foundation.  Something like a DAX-aware
device mapper layer that duplicates data tagged with REQ_META so at
least we have a recovery path when a sector error lands in critical
filesystem-metadata.  However, anything we come up with to make NVDIMM
errors more survivable should be directly applicable to traditional
disk storage as well.  Along these lines we had a BoF session at Vault
where drive vendors we're wondering if the sysfs bad sectors list
could help software recover from the loss of a disk-head, or other
errors 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 5:11 PM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
[..]
>> Maybe I missed something, but all these assumptions are already
>> present for typical block devices, i.e. sectors may go bad and a write
>> may make the sector usable again.
>
> The assumption we make about sectors going bad on SSDs or SRDs is
> that the device is about to die and needs replacing ASAP.

Similar assumptions here.  Storage media is experiencing errors and
past a certain threshold it may be time to decommission the device.

You can see definitions for SMART / media health commands from various
vendors at these links, and yes, hopefully these are standardized /
unified at some point down the road:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
https://github.com/HewlettPackard/hpe-nvm/blob/master/Documentation/NFIT_DSM_DDR4_NVDIMM-N_v84s.pdf
https://msdn.microsoft.com/en-us/library/windows/hardware/mt604717(v=vs.85).aspx


> Then
> RAID takes care of the rebuild completely transparently. i.e.
> handling and correcting bad sectors is typically done completely
> transparently /below/ the filesytem like so:

Again, same for an NVDIMM.  Use the pmem block-device as a RAID-member device.

>> This patch series is extending that
>> out to the DAX-mmap case, but it's the same principle of "write to
>> clear error" that we live with in the block-I/O path.  What
>> clarification are you looking for beyond that point?
>
> I'm asking for an actual design document that explains how moving
> all the redundancy and bad sector correction stuff from the LBA
> layer up into application space is supposed to work when
> applications have no clue about LBA mappings, nor tend to keep
> redundant data around. i.e. you're proposing this:

These patches are not proposing *new* / general infrastructure for
moving redundancy and bad sector correction handling to userspace.  If
an existing app is somehow dealing with raw (without RAID) device
errors on disk storage media today it should not need to change to
handle errors on an NVDIMM.  My expectation is that very few if any
applications handle this today and just fail in the presence of media
errors.

> Application
> Application data redundancy/correction
> Filesystem
> Block
> [LBA mapping/redundancy/correction driver e.g. md/dm]
> driver
> hardware
>
> And somehow all the error information from the hardware layer needs
> to be propagated up to the application layer, along with all the
> mapping information from the filesystem and block layers for the
> application to make sense of the hardware reported errors.
>
> I see assumptions this this "just works" but we don't have any of
> the relevant APIs or infrastructure to enable the application to do
> the hardware error->file+offset namespace mapping (i.e. filesystem
> reverse mapping for for file offsets and directory paths, and
> reverse mapping for the the block layer remapping drivers).

If an application expects errors to be handled beneath the filesystem
then it should forgo DAX and arrange for the NVDIMM devices to be
RAIDed.  Otherwise, if an application wants to use DAX then it might
need to be prepared to handle media errors itself same as the
un-RAIDed disk case.  Yes, at an administrative level without
reverse-mapping support from a filesystem there's presently no way to
ask "which files on this fs are impacted by media errors", and we're
aware that reverse-mapping capabilities are nascent for current
DAX-aware filesystems.  The forward lookup path, as impractical as it
is for large numbers of files, is available if an application wanted
to know if a specific file was impacted.  We've discussed possibly
extending fiemap() to return bad blocks in a file rather than
consulting sysfs, or extending lseek() with something like SEEK_ERROR
to return offsets of bad areas in a file.

> I haven't seen any design/documentation for infrastructure at the
> application layer to handle redundant data and correctly
> transparently so I don't have any idea what the technical
> requirements this different IO stack places on filesystems may be.
> Hence I'm asking for some kind of architecture/design documentation
> that I can read to understand exactly what is being proposed here...

I think this is a discussion for a solution that would build on top of
this basic "here are the errors, re-write them with good data if you
can; otherwise, best of luck" foundation.  Something like a DAX-aware
device mapper layer that duplicates data tagged with REQ_META so at
least we have a recovery path when a sector error lands in critical
filesystem-metadata.  However, anything we come up with to make NVDIMM
errors more survivable should be directly applicable to traditional
disk storage as well.  Along these lines we had a BoF session at Vault
where drive vendors we're wondering if the sysfs bad sectors list
could help software recover from the loss of a disk-head, or other
errors that only take down 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> > 
> <>
> 
> > > 
> > > - It checks badblocks and discovers it's files have lost data
> > Lots of hand-waving here. How does the application map a bad
> > "sector" to a file without scanning the entire filesystem to find
> > the owner of the bad sector?
> 
> Yes this was hand-wavey, but we talked about this a bit at LSF..
> The idea is that a per-block-device badblocks list is available at
> /sys/block//badblocks. The application (or a suitable yet-to-be-
> written library function) does a fiemap to figure out the sectors its
> files are using, and correlates the two lists.
> We can also look into providing an easier-to-use interface from the
> kernel, in the form of an fiemap flag to report only the bad sectors, or
> a SEEK_BAD flag..
> The application doesn't have to scan the entire filesystem, but
> presumably it knows what files it 'owns', and does a fiemap for those.

You're assuming that only the DAX aware application accesses it's
files.  users, backup programs, data replicators, fileystem
re-organisers (e.g.  defragmenters) etc all may access the files and
they may throw errors. What then?

> > > - It write()s those sectors (possibly converted to file offsets
> > > using
> > > fiemap)
> > >     * This triggers the fallback path, but if the application is
> > > doing
> > > this level of recovery, it will know the sector is bad, and write
> > > the
> > > entire sector
> > Where does the application find the data that was lost to be able to
> > rewrite it?
> 
> The data that was lost is gone -- this assumes the application has some
> ability to recover using a journal/log or other redundancy - yes, at the
> application layer. If it doesn't have this sort of capability, the only
> option is to restore files from a backup/mirror.

So the architecture has a built in assumption that only userspace
can handle data loss?

What about filesytsems like NOVA, that use log structured design to
provide DAX w/ update atomicity and can potentially also provide
redundancy/repair through the same mechanisms? Won't pmem native
filesystems with built in data protection features like this remove
the need for adding all this to userspace applications?

If so, shouldn't that be the focus of development rahter than
placing the burden on userspace apps to handle storage repair
situations?

> > > - Or it replaces the entire file from backup also using write() (not
> > > mmap+stores)
> > >     * This just frees the fs block, and the next time the block is
> > > reallocated by the fs, it will likely be zeroed first, and that will
> > > be
> > > done through the driver and will clear errors
> > There's an implicit assumption that applications will keep redundant
> > copies of their data at the /application layer/ and be able to
> > automatically repair it? And then there's the implicit assumption
> > that it will unlink and free the entire file before writing a new
> > copy, and that then assumes the the filesystem will zero blocks if
> > they get reused to clear errors on that LBA sector mapping before
> > they are accessible again to userspace..
> > 
> > It seems to me that there are a number of assumptions being made
> > across multiple layers here. Maybe I've missed something - can you
> > point me to the design/architecture description so I can see how
> > "app does data recovery itself" dance is supposed to work?
> 
> There isn't a document other than the flow in my head :) - but maybe I
> could write one up..
> I wasn't thinking the application itself maintains and restores from
> backup copy of the file.. The application hits either a SIGBUS or EIO
> depending on how it accesses the data, and crashes or raises some alarm.
> The recovery is then done out-of-band, by a sysadmin or such (i.e.
> delete the file, replace with a known good copy, restart application).
> 
> To summarize, the two cases we want to handle are:
> 1. Application has inbuilt recovery:
>   - hits badblock
>   - figures out it is able to recover the data
>   - handles SIGBUS or EIO
>   - does a (sector aligned) write() to restore the data

The "figures out" step here is where >95% of the work we'd have to
do is. And that's in filesystem and block layer code, not
userspace, and userspace can't do that work in a signal handler.
And it  can still fall down to the second case when the application
doesn't have another copy of the data somewhere.

FWIW, we don't have a DAX enabled filesystem that can do
reverse block mapping, so we're a year or two away from this being a
workable production solution from the filesystem perspective. And
AFAICT, it's not even on the roadmap for dm/md layers.

> 2. Application doesn't have any inbuilt recovery mechanism
>   - hits badblock
>   - gets SIGBUS (or EIO) and crashes
>   - Sysadmin restores file from backup

Which is no different to an existing non-DAX application getting an
EIO/sigbus from 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 11:53:13PM +, Verma, Vishal L wrote:
> On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> > 
> <>
> 
> > > 
> > > - It checks badblocks and discovers it's files have lost data
> > Lots of hand-waving here. How does the application map a bad
> > "sector" to a file without scanning the entire filesystem to find
> > the owner of the bad sector?
> 
> Yes this was hand-wavey, but we talked about this a bit at LSF..
> The idea is that a per-block-device badblocks list is available at
> /sys/block//badblocks. The application (or a suitable yet-to-be-
> written library function) does a fiemap to figure out the sectors its
> files are using, and correlates the two lists.
> We can also look into providing an easier-to-use interface from the
> kernel, in the form of an fiemap flag to report only the bad sectors, or
> a SEEK_BAD flag..
> The application doesn't have to scan the entire filesystem, but
> presumably it knows what files it 'owns', and does a fiemap for those.

You're assuming that only the DAX aware application accesses it's
files.  users, backup programs, data replicators, fileystem
re-organisers (e.g.  defragmenters) etc all may access the files and
they may throw errors. What then?

> > > - It write()s those sectors (possibly converted to file offsets
> > > using
> > > fiemap)
> > >     * This triggers the fallback path, but if the application is
> > > doing
> > > this level of recovery, it will know the sector is bad, and write
> > > the
> > > entire sector
> > Where does the application find the data that was lost to be able to
> > rewrite it?
> 
> The data that was lost is gone -- this assumes the application has some
> ability to recover using a journal/log or other redundancy - yes, at the
> application layer. If it doesn't have this sort of capability, the only
> option is to restore files from a backup/mirror.

So the architecture has a built in assumption that only userspace
can handle data loss?

What about filesytsems like NOVA, that use log structured design to
provide DAX w/ update atomicity and can potentially also provide
redundancy/repair through the same mechanisms? Won't pmem native
filesystems with built in data protection features like this remove
the need for adding all this to userspace applications?

If so, shouldn't that be the focus of development rahter than
placing the burden on userspace apps to handle storage repair
situations?

> > > - Or it replaces the entire file from backup also using write() (not
> > > mmap+stores)
> > >     * This just frees the fs block, and the next time the block is
> > > reallocated by the fs, it will likely be zeroed first, and that will
> > > be
> > > done through the driver and will clear errors
> > There's an implicit assumption that applications will keep redundant
> > copies of their data at the /application layer/ and be able to
> > automatically repair it? And then there's the implicit assumption
> > that it will unlink and free the entire file before writing a new
> > copy, and that then assumes the the filesystem will zero blocks if
> > they get reused to clear errors on that LBA sector mapping before
> > they are accessible again to userspace..
> > 
> > It seems to me that there are a number of assumptions being made
> > across multiple layers here. Maybe I've missed something - can you
> > point me to the design/architecture description so I can see how
> > "app does data recovery itself" dance is supposed to work?
> 
> There isn't a document other than the flow in my head :) - but maybe I
> could write one up..
> I wasn't thinking the application itself maintains and restores from
> backup copy of the file.. The application hits either a SIGBUS or EIO
> depending on how it accesses the data, and crashes or raises some alarm.
> The recovery is then done out-of-band, by a sysadmin or such (i.e.
> delete the file, replace with a known good copy, restart application).
> 
> To summarize, the two cases we want to handle are:
> 1. Application has inbuilt recovery:
>   - hits badblock
>   - figures out it is able to recover the data
>   - handles SIGBUS or EIO
>   - does a (sector aligned) write() to restore the data

The "figures out" step here is where >95% of the work we'd have to
do is. And that's in filesystem and block layer code, not
userspace, and userspace can't do that work in a signal handler.
And it  can still fall down to the second case when the application
doesn't have another copy of the data somewhere.

FWIW, we don't have a DAX enabled filesystem that can do
reverse block mapping, so we're a year or two away from this being a
workable production solution from the filesystem perspective. And
AFAICT, it's not even on the roadmap for dm/md layers.

> 2. Application doesn't have any inbuilt recovery mechanism
>   - hits badblock
>   - gets SIGBUS (or EIO) and crashes
>   - Sysadmin restores file from backup

Which is no different to an existing non-DAX application getting an
EIO/sigbus from 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 4:25 PM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> >> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> >> > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> >> > >
> >> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> >> > > due
> >> > > to some allocation failing, and I thought we should return the
> >> > > original
> >> > > -EIO in such cases so that the application doesn't lose the
> >> > > information
> >> > > that the bad block is actually causing the error.
> >> > EINVAL is a concern here.  Not due to the right error reported, but
> >> > because it means your current scheme is fundamentally broken - we
> >> > need to support I/O at any alignment for DAX I/O, and not fail due to
> >> > alignbment concernes for a highly specific degraded case.
> >> >
> >> > I think this whole series need to go back to the drawing board as I
> >> > don't think it can actually rely on using direct I/O as the EIO
> >> > fallback.
> >> >
> >> Agreed that DAX I/O can happen with any size/alignment, but how else do
> >> we send an IO through the driver without alignment restrictions? Also,
> >> the granularity at which we store badblocks is 512B sectors, so it
> >> seems natural that to clear such a sector, you'd expect to send a write
> >> to the whole sector.
> >>
> >> The expected usage flow is:
> >>
> >> - Application hits EIO doing dax_IO or load/store io
> >>
> >> - It checks badblocks and discovers it's files have lost data
> >
> > Lots of hand-waving here. How does the application map a bad
> > "sector" to a file without scanning the entire filesystem to find
> > the owner of the bad sector?
> >
> >> - It write()s those sectors (possibly converted to file offsets using
> >> fiemap)
> >> * This triggers the fallback path, but if the application is doing
> >> this level of recovery, it will know the sector is bad, and write the
> >> entire sector
> >
> > Where does the application find the data that was lost to be able to
> > rewrite it?
> >
> >> - Or it replaces the entire file from backup also using write() (not
> >> mmap+stores)
> >> * This just frees the fs block, and the next time the block is
> >> reallocated by the fs, it will likely be zeroed first, and that will be
> >> done through the driver and will clear errors
> >
> > There's an implicit assumption that applications will keep redundant
> > copies of their data at the /application layer/ and be able to
> > automatically repair it? And then there's the implicit assumption
> > that it will unlink and free the entire file before writing a new
> > copy, and that then assumes the the filesystem will zero blocks if
> > they get reused to clear errors on that LBA sector mapping before
> > they are accessible again to userspace..
> >
> > It seems to me that there are a number of assumptions being made
> > across multiple layers here. Maybe I've missed something - can you
> > point me to the design/architecture description so I can see how
> > "app does data recovery itself" dance is supposed to work?
> >
> 
> Maybe I missed something, but all these assumptions are already
> present for typical block devices, i.e. sectors may go bad and a write
> may make the sector usable again.

The assumption we make about sectors going bad on SSDs or SRDs is
that the device is about to die and needs replacing ASAP. Then
RAID takes care of the rebuild completely transparently. i.e.
handling and correcting bad sectors is typically done completely
transparently /below/ the filesytem like so:

Application
Filesystem
block
[LBA mapping/redundancy/correction driver e.g. md/dm]
driver
hardware
[LBA redundancy/correction e.g h/w RAID]

In the case of filesystems with their own RAID/redundancy code (e.g.
btrfs), then it looks like this:

Application
Filesystem
mapping/redundancy/correction driver
block
driver
hardware
[LBA redundancy/correction e.g h/w RAID]

> This patch series is extending that
> out to the DAX-mmap case, but it's the same principle of "write to
> clear error" that we live with in the block-I/O path.  What
> clarification are you looking for beyond that point?

I'm asking for an actual design document that explains how moving
all the redundancy and bad sector correction stuff from the LBA
layer up into application space is supposed to work when
applications have no clue about LBA mappings, nor tend to keep
redundant data around. i.e. you're proposing this:

Application
Application data redundancy/correction
Filesystem
Block
[LBA mapping/redundancy/correction driver e.g. md/dm]
driver
hardware

And somehow all the error information from the hardware layer needs
to be propagated up to the application layer, along with all the
mapping information from the filesystem and block layers for the
application to make sense of the hardware reported errors.

I 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 04:43:14PM -0700, Dan Williams wrote:
> On Mon, Apr 25, 2016 at 4:25 PM, Dave Chinner  wrote:
> > On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> >> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> >> > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> >> > >
> >> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> >> > > due
> >> > > to some allocation failing, and I thought we should return the
> >> > > original
> >> > > -EIO in such cases so that the application doesn't lose the
> >> > > information
> >> > > that the bad block is actually causing the error.
> >> > EINVAL is a concern here.  Not due to the right error reported, but
> >> > because it means your current scheme is fundamentally broken - we
> >> > need to support I/O at any alignment for DAX I/O, and not fail due to
> >> > alignbment concernes for a highly specific degraded case.
> >> >
> >> > I think this whole series need to go back to the drawing board as I
> >> > don't think it can actually rely on using direct I/O as the EIO
> >> > fallback.
> >> >
> >> Agreed that DAX I/O can happen with any size/alignment, but how else do
> >> we send an IO through the driver without alignment restrictions? Also,
> >> the granularity at which we store badblocks is 512B sectors, so it
> >> seems natural that to clear such a sector, you'd expect to send a write
> >> to the whole sector.
> >>
> >> The expected usage flow is:
> >>
> >> - Application hits EIO doing dax_IO or load/store io
> >>
> >> - It checks badblocks and discovers it's files have lost data
> >
> > Lots of hand-waving here. How does the application map a bad
> > "sector" to a file without scanning the entire filesystem to find
> > the owner of the bad sector?
> >
> >> - It write()s those sectors (possibly converted to file offsets using
> >> fiemap)
> >> * This triggers the fallback path, but if the application is doing
> >> this level of recovery, it will know the sector is bad, and write the
> >> entire sector
> >
> > Where does the application find the data that was lost to be able to
> > rewrite it?
> >
> >> - Or it replaces the entire file from backup also using write() (not
> >> mmap+stores)
> >> * This just frees the fs block, and the next time the block is
> >> reallocated by the fs, it will likely be zeroed first, and that will be
> >> done through the driver and will clear errors
> >
> > There's an implicit assumption that applications will keep redundant
> > copies of their data at the /application layer/ and be able to
> > automatically repair it? And then there's the implicit assumption
> > that it will unlink and free the entire file before writing a new
> > copy, and that then assumes the the filesystem will zero blocks if
> > they get reused to clear errors on that LBA sector mapping before
> > they are accessible again to userspace..
> >
> > It seems to me that there are a number of assumptions being made
> > across multiple layers here. Maybe I've missed something - can you
> > point me to the design/architecture description so I can see how
> > "app does data recovery itself" dance is supposed to work?
> >
> 
> Maybe I missed something, but all these assumptions are already
> present for typical block devices, i.e. sectors may go bad and a write
> may make the sector usable again.

The assumption we make about sectors going bad on SSDs or SRDs is
that the device is about to die and needs replacing ASAP. Then
RAID takes care of the rebuild completely transparently. i.e.
handling and correcting bad sectors is typically done completely
transparently /below/ the filesytem like so:

Application
Filesystem
block
[LBA mapping/redundancy/correction driver e.g. md/dm]
driver
hardware
[LBA redundancy/correction e.g h/w RAID]

In the case of filesystems with their own RAID/redundancy code (e.g.
btrfs), then it looks like this:

Application
Filesystem
mapping/redundancy/correction driver
block
driver
hardware
[LBA redundancy/correction e.g h/w RAID]

> This patch series is extending that
> out to the DAX-mmap case, but it's the same principle of "write to
> clear error" that we live with in the block-I/O path.  What
> clarification are you looking for beyond that point?

I'm asking for an actual design document that explains how moving
all the redundancy and bad sector correction stuff from the LBA
layer up into application space is supposed to work when
applications have no clue about LBA mappings, nor tend to keep
redundant data around. i.e. you're proposing this:

Application
Application data redundancy/correction
Filesystem
Block
[LBA mapping/redundancy/correction driver e.g. md/dm]
driver
hardware

And somehow all the error information from the hardware layer needs
to be propagated up to the application layer, along with all the
mapping information from the filesystem and block layers for the
application to make sense of the hardware reported errors.

I see assumptions this 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Verma, Vishal L
On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> 
<>

> > 
> > - It checks badblocks and discovers it's files have lost data
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?

Yes this was hand-wavey, but we talked about this a bit at LSF..
The idea is that a per-block-device badblocks list is available at
/sys/block//badblocks. The application (or a suitable yet-to-be-
written library function) does a fiemap to figure out the sectors its
files are using, and correlates the two lists.
We can also look into providing an easier-to-use interface from the
kernel, in the form of an fiemap flag to report only the bad sectors, or
a SEEK_BAD flag..
The application doesn't have to scan the entire filesystem, but
presumably it knows what files it 'owns', and does a fiemap for those.

> 
> > 
> > - It write()s those sectors (possibly converted to file offsets
> > using
> > fiemap)
> >     * This triggers the fallback path, but if the application is
> > doing
> > this level of recovery, it will know the sector is bad, and write
> > the
> > entire sector
> Where does the application find the data that was lost to be able to
> rewrite it?

The data that was lost is gone -- this assumes the application has some
ability to recover using a journal/log or other redundancy - yes, at the
application layer. If it doesn't have this sort of capability, the only
option is to restore files from a backup/mirror.

> 
> > 
> > - Or it replaces the entire file from backup also using write() (not
> > mmap+stores)
> >     * This just frees the fs block, and the next time the block is
> > reallocated by the fs, it will likely be zeroed first, and that will
> > be
> > done through the driver and will clear errors
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
> 
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?

There isn't a document other than the flow in my head :) - but maybe I
could write one up..
I wasn't thinking the application itself maintains and restores from
backup copy of the file.. The application hits either a SIGBUS or EIO
depending on how it accesses the data, and crashes or raises some alarm.
The recovery is then done out-of-band, by a sysadmin or such (i.e.
delete the file, replace with a known good copy, restart application).

To summarize, the two cases we want to handle are:
1. Application has inbuilt recovery:
  - hits badblock
  - figures out it is able to recover the data
  - handles SIGBUS or EIO
  - does a (sector aligned) write() to restore the data
2. Application doesn't have any inbuilt recovery mechanism
  - hits badblock
  - gets SIGBUS (or EIO) and crashes
  - Sysadmin restores file from backup

Case 1 is handled by either a fallback to direct_IO from dax_do_io, or
always _actually_ doing direct_IO when we're opened with O_DIRECT in
spite of dax (what Dan suggested). Currently if we're mounted with dax,
all IO O_DIRECT or otherwise will go through dax_do_io.
Case 2 is handled by patch 4 of the series:
    dax: use sb_issue_zerout instead of calling dax_clear_sectors

> 
> Cheers,
> 
> Dave.

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Verma, Vishal L
On Tue, 2016-04-26 at 09:25 +1000, Dave Chinner wrote:
> 
<>

> > 
> > - It checks badblocks and discovers it's files have lost data
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?

Yes this was hand-wavey, but we talked about this a bit at LSF..
The idea is that a per-block-device badblocks list is available at
/sys/block//badblocks. The application (or a suitable yet-to-be-
written library function) does a fiemap to figure out the sectors its
files are using, and correlates the two lists.
We can also look into providing an easier-to-use interface from the
kernel, in the form of an fiemap flag to report only the bad sectors, or
a SEEK_BAD flag..
The application doesn't have to scan the entire filesystem, but
presumably it knows what files it 'owns', and does a fiemap for those.

> 
> > 
> > - It write()s those sectors (possibly converted to file offsets
> > using
> > fiemap)
> >     * This triggers the fallback path, but if the application is
> > doing
> > this level of recovery, it will know the sector is bad, and write
> > the
> > entire sector
> Where does the application find the data that was lost to be able to
> rewrite it?

The data that was lost is gone -- this assumes the application has some
ability to recover using a journal/log or other redundancy - yes, at the
application layer. If it doesn't have this sort of capability, the only
option is to restore files from a backup/mirror.

> 
> > 
> > - Or it replaces the entire file from backup also using write() (not
> > mmap+stores)
> >     * This just frees the fs block, and the next time the block is
> > reallocated by the fs, it will likely be zeroed first, and that will
> > be
> > done through the driver and will clear errors
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
> 
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?

There isn't a document other than the flow in my head :) - but maybe I
could write one up..
I wasn't thinking the application itself maintains and restores from
backup copy of the file.. The application hits either a SIGBUS or EIO
depending on how it accesses the data, and crashes or raises some alarm.
The recovery is then done out-of-band, by a sysadmin or such (i.e.
delete the file, replace with a known good copy, restart application).

To summarize, the two cases we want to handle are:
1. Application has inbuilt recovery:
  - hits badblock
  - figures out it is able to recover the data
  - handles SIGBUS or EIO
  - does a (sector aligned) write() to restore the data
2. Application doesn't have any inbuilt recovery mechanism
  - hits badblock
  - gets SIGBUS (or EIO) and crashes
  - Sysadmin restores file from backup

Case 1 is handled by either a fallback to direct_IO from dax_do_io, or
always _actually_ doing direct_IO when we're opened with O_DIRECT in
spite of dax (what Dan suggested). Currently if we're mounted with dax,
all IO O_DIRECT or otherwise will go through dax_do_io.
Case 2 is handled by patch 4 of the series:
    dax: use sb_issue_zerout instead of calling dax_clear_sectors

> 
> Cheers,
> 
> Dave.

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 4:25 PM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
>> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
>> > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
>> > >
>> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
>> > > due
>> > > to some allocation failing, and I thought we should return the
>> > > original
>> > > -EIO in such cases so that the application doesn't lose the
>> > > information
>> > > that the bad block is actually causing the error.
>> > EINVAL is a concern here.  Not due to the right error reported, but
>> > because it means your current scheme is fundamentally broken - we
>> > need to support I/O at any alignment for DAX I/O, and not fail due to
>> > alignbment concernes for a highly specific degraded case.
>> >
>> > I think this whole series need to go back to the drawing board as I
>> > don't think it can actually rely on using direct I/O as the EIO
>> > fallback.
>> >
>> Agreed that DAX I/O can happen with any size/alignment, but how else do
>> we send an IO through the driver without alignment restrictions? Also,
>> the granularity at which we store badblocks is 512B sectors, so it
>> seems natural that to clear such a sector, you'd expect to send a write
>> to the whole sector.
>>
>> The expected usage flow is:
>>
>> - Application hits EIO doing dax_IO or load/store io
>>
>> - It checks badblocks and discovers it's files have lost data
>
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?
>
>> - It write()s those sectors (possibly converted to file offsets using
>> fiemap)
>> * This triggers the fallback path, but if the application is doing
>> this level of recovery, it will know the sector is bad, and write the
>> entire sector
>
> Where does the application find the data that was lost to be able to
> rewrite it?
>
>> - Or it replaces the entire file from backup also using write() (not
>> mmap+stores)
>> * This just frees the fs block, and the next time the block is
>> reallocated by the fs, it will likely be zeroed first, and that will be
>> done through the driver and will clear errors
>
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
>
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
>

Maybe I missed something, but all these assumptions are already
present for typical block devices, i.e. sectors may go bad and a write
may make the sector usable again.  This patch series is extending that
out to the DAX-mmap case, but it's the same principle of "write to
clear error" that we live with in the block-I/O path.  What
clarification are you looking for beyond that point?


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 4:25 PM, Dave Chinner  wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
>> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
>> > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
>> > >
>> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
>> > > due
>> > > to some allocation failing, and I thought we should return the
>> > > original
>> > > -EIO in such cases so that the application doesn't lose the
>> > > information
>> > > that the bad block is actually causing the error.
>> > EINVAL is a concern here.  Not due to the right error reported, but
>> > because it means your current scheme is fundamentally broken - we
>> > need to support I/O at any alignment for DAX I/O, and not fail due to
>> > alignbment concernes for a highly specific degraded case.
>> >
>> > I think this whole series need to go back to the drawing board as I
>> > don't think it can actually rely on using direct I/O as the EIO
>> > fallback.
>> >
>> Agreed that DAX I/O can happen with any size/alignment, but how else do
>> we send an IO through the driver without alignment restrictions? Also,
>> the granularity at which we store badblocks is 512B sectors, so it
>> seems natural that to clear such a sector, you'd expect to send a write
>> to the whole sector.
>>
>> The expected usage flow is:
>>
>> - Application hits EIO doing dax_IO or load/store io
>>
>> - It checks badblocks and discovers it's files have lost data
>
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?
>
>> - It write()s those sectors (possibly converted to file offsets using
>> fiemap)
>> * This triggers the fallback path, but if the application is doing
>> this level of recovery, it will know the sector is bad, and write the
>> entire sector
>
> Where does the application find the data that was lost to be able to
> rewrite it?
>
>> - Or it replaces the entire file from backup also using write() (not
>> mmap+stores)
>> * This just frees the fs block, and the next time the block is
>> reallocated by the fs, it will likely be zeroed first, and that will be
>> done through the driver and will clear errors
>
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
>
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
>

Maybe I missed something, but all these assumptions are already
present for typical block devices, i.e. sectors may go bad and a write
may make the sector usable again.  This patch series is extending that
out to the DAX-mmap case, but it's the same principle of "write to
clear error" that we live with in the block-I/O path.  What
clarification are you looking for beyond that point?


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Darrick J. Wong
On Tue, Apr 26, 2016 at 09:25:52AM +1000, Dave Chinner wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> > On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> > > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> > > > 
> > > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > > > due
> > > > to some allocation failing, and I thought we should return the
> > > > original
> > > > -EIO in such cases so that the application doesn't lose the
> > > > information
> > > > that the bad block is actually causing the error.
> > > EINVAL is a concern here.  Not due to the right error reported, but
> > > because it means your current scheme is fundamentally broken - we
> > > need to support I/O at any alignment for DAX I/O, and not fail due to
> > > alignbment concernes for a highly specific degraded case.
> > > 
> > > I think this whole series need to go back to the drawing board as I
> > > don't think it can actually rely on using direct I/O as the EIO
> > > fallback.
> > > 
> > Agreed that DAX I/O can happen with any size/alignment, but how else do
> > we send an IO through the driver without alignment restrictions? Also,
> > the granularity at which we store badblocks is 512B sectors, so it
> > seems natural that to clear such a sector, you'd expect to send a write
> > to the whole sector.
> > 
> > The expected usage flow is:
> > 
> > - Application hits EIO doing dax_IO or load/store io
> > 
> > - It checks badblocks and discovers it's files have lost data
> 
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?

FWIW there was some discussion @ LSF about using (XFS) rmap to figure out
which parts of a file (on XFS) have gone bad.  Chris Mason said that he'd
like to collaborate on having a common getfsmap ioctl between btrfs and
XFS since they have a backref index that could be hooked up to it for them.

Obviously the app still has to coordinate stopping file IO and calling
GETFSMAP since the fs won't do that on its own.  There's also the question
of how to handle LBA translation if there's other stuff like dm in the way.
I don't think device-mapper or md do reverse mapping, so things get murky
from here.

Guess I should get on pushing out a getfsmap patch for review. :)

--D

(/me doesn't have answers to any of your other questions.)

> > - It write()s those sectors (possibly converted to file offsets using
> > fiemap)
> >     * This triggers the fallback path, but if the application is doing
> > this level of recovery, it will know the sector is bad, and write the
> > entire sector
> 
> Where does the application find the data that was lost to be able to
> rewrite it?
> 
> > - Or it replaces the entire file from backup also using write() (not
> > mmap+stores)
> >     * This just frees the fs block, and the next time the block is
> > reallocated by the fs, it will likely be zeroed first, and that will be
> > done through the driver and will clear errors
> 
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
> 
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Darrick J. Wong
On Tue, Apr 26, 2016 at 09:25:52AM +1000, Dave Chinner wrote:
> On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> > On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> > > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> > > > 
> > > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > > > due
> > > > to some allocation failing, and I thought we should return the
> > > > original
> > > > -EIO in such cases so that the application doesn't lose the
> > > > information
> > > > that the bad block is actually causing the error.
> > > EINVAL is a concern here.  Not due to the right error reported, but
> > > because it means your current scheme is fundamentally broken - we
> > > need to support I/O at any alignment for DAX I/O, and not fail due to
> > > alignbment concernes for a highly specific degraded case.
> > > 
> > > I think this whole series need to go back to the drawing board as I
> > > don't think it can actually rely on using direct I/O as the EIO
> > > fallback.
> > > 
> > Agreed that DAX I/O can happen with any size/alignment, but how else do
> > we send an IO through the driver without alignment restrictions? Also,
> > the granularity at which we store badblocks is 512B sectors, so it
> > seems natural that to clear such a sector, you'd expect to send a write
> > to the whole sector.
> > 
> > The expected usage flow is:
> > 
> > - Application hits EIO doing dax_IO or load/store io
> > 
> > - It checks badblocks and discovers it's files have lost data
> 
> Lots of hand-waving here. How does the application map a bad
> "sector" to a file without scanning the entire filesystem to find
> the owner of the bad sector?

FWIW there was some discussion @ LSF about using (XFS) rmap to figure out
which parts of a file (on XFS) have gone bad.  Chris Mason said that he'd
like to collaborate on having a common getfsmap ioctl between btrfs and
XFS since they have a backref index that could be hooked up to it for them.

Obviously the app still has to coordinate stopping file IO and calling
GETFSMAP since the fs won't do that on its own.  There's also the question
of how to handle LBA translation if there's other stuff like dm in the way.
I don't think device-mapper or md do reverse mapping, so things get murky
from here.

Guess I should get on pushing out a getfsmap patch for review. :)

--D

(/me doesn't have answers to any of your other questions.)

> > - It write()s those sectors (possibly converted to file offsets using
> > fiemap)
> >     * This triggers the fallback path, but if the application is doing
> > this level of recovery, it will know the sector is bad, and write the
> > entire sector
> 
> Where does the application find the data that was lost to be able to
> rewrite it?
> 
> > - Or it replaces the entire file from backup also using write() (not
> > mmap+stores)
> >     * This just frees the fs block, and the next time the block is
> > reallocated by the fs, it will likely be zeroed first, and that will be
> > done through the driver and will clear errors
> 
> There's an implicit assumption that applications will keep redundant
> copies of their data at the /application layer/ and be able to
> automatically repair it? And then there's the implicit assumption
> that it will unlink and free the entire file before writing a new
> copy, and that then assumes the the filesystem will zero blocks if
> they get reused to clear errors on that LBA sector mapping before
> they are accessible again to userspace..
> 
> It seems to me that there are a number of assumptions being made
> across multiple layers here. Maybe I've missed something - can you
> point me to the design/architecture description so I can see how
> "app does data recovery itself" dance is supposed to work?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> > > 
> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > > due
> > > to some allocation failing, and I thought we should return the
> > > original
> > > -EIO in such cases so that the application doesn't lose the
> > > information
> > > that the bad block is actually causing the error.
> > EINVAL is a concern here.  Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> > 
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> > 
> Agreed that DAX I/O can happen with any size/alignment, but how else do
> we send an IO through the driver without alignment restrictions? Also,
> the granularity at which we store badblocks is 512B sectors, so it
> seems natural that to clear such a sector, you'd expect to send a write
> to the whole sector.
> 
> The expected usage flow is:
> 
> - Application hits EIO doing dax_IO or load/store io
> 
> - It checks badblocks and discovers it's files have lost data

Lots of hand-waving here. How does the application map a bad
"sector" to a file without scanning the entire filesystem to find
the owner of the bad sector?

> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
>     * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector

Where does the application find the data that was lost to be able to
rewrite it?

> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
>     * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will be
> done through the driver and will clear errors

There's an implicit assumption that applications will keep redundant
copies of their data at the /application layer/ and be able to
automatically repair it? And then there's the implicit assumption
that it will unlink and free the entire file before writing a new
copy, and that then assumes the the filesystem will zero blocks if
they get reused to clear errors on that LBA sector mapping before
they are accessible again to userspace..

It seems to me that there are a number of assumptions being made
across multiple layers here. Maybe I've missed something - can you
point me to the design/architecture description so I can see how
"app does data recovery itself" dance is supposed to work?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dave Chinner
On Mon, Apr 25, 2016 at 05:14:36PM +, Verma, Vishal L wrote:
> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> > On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> > > 
> > > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > > due
> > > to some allocation failing, and I thought we should return the
> > > original
> > > -EIO in such cases so that the application doesn't lose the
> > > information
> > > that the bad block is actually causing the error.
> > EINVAL is a concern here.  Not due to the right error reported, but
> > because it means your current scheme is fundamentally broken - we
> > need to support I/O at any alignment for DAX I/O, and not fail due to
> > alignbment concernes for a highly specific degraded case.
> > 
> > I think this whole series need to go back to the drawing board as I
> > don't think it can actually rely on using direct I/O as the EIO
> > fallback.
> > 
> Agreed that DAX I/O can happen with any size/alignment, but how else do
> we send an IO through the driver without alignment restrictions? Also,
> the granularity at which we store badblocks is 512B sectors, so it
> seems natural that to clear such a sector, you'd expect to send a write
> to the whole sector.
> 
> The expected usage flow is:
> 
> - Application hits EIO doing dax_IO or load/store io
> 
> - It checks badblocks and discovers it's files have lost data

Lots of hand-waving here. How does the application map a bad
"sector" to a file without scanning the entire filesystem to find
the owner of the bad sector?

> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
>     * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector

Where does the application find the data that was lost to be able to
rewrite it?

> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
>     * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will be
> done through the driver and will clear errors

There's an implicit assumption that applications will keep redundant
copies of their data at the /application layer/ and be able to
automatically repair it? And then there's the implicit assumption
that it will unlink and free the entire file before writing a new
copy, and that then assumes the the filesystem will zero blocks if
they get reused to clear errors on that LBA sector mapping before
they are accessible again to userspace..

It seems to me that there are a number of assumptions being made
across multiple layers here. Maybe I've missed something - can you
point me to the design/architecture description so I can see how
"app does data recovery itself" dance is supposed to work?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 10:14 AM, Verma, Vishal L
 wrote:
> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
>> On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
>> >
>> > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
>> > due
>> > to some allocation failing, and I thought we should return the
>> > original
>> > -EIO in such cases so that the application doesn't lose the
>> > information
>> > that the bad block is actually causing the error.
>> EINVAL is a concern here.  Not due to the right error reported, but
>> because it means your current scheme is fundamentally broken - we
>> need to support I/O at any alignment for DAX I/O, and not fail due to
>> alignbment concernes for a highly specific degraded case.
>>
>> I think this whole series need to go back to the drawing board as I
>> don't think it can actually rely on using direct I/O as the EIO
>> fallback.
>>
> Agreed that DAX I/O can happen with any size/alignment, but how else do
> we send an IO through the driver without alignment restrictions? Also,
> the granularity at which we store badblocks is 512B sectors, so it
> seems natural that to clear such a sector, you'd expect to send a write
> to the whole sector.
>
> The expected usage flow is:
>
> - Application hits EIO doing dax_IO or load/store io
>
> - It checks badblocks and discovers it's files have lost data
>
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector
>
> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
> * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will be
> done through the driver and will clear errors
>
>
> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity

3. Arrange for O_DIRECT to bypass dax_do_io(), and leave the
optimization only for the dax "buffered I/O" case.

4. Skip dax_do_io() entirely in the presence of errors

I think 3 is the most closely aligned with the typical block device
model.  In the typical case a buffered write may fail due to a
badblock read when filling the page cache, but an O_DIRECT write would
bypass the page cache and potentially clear the error / cause the
block to be reallocated internally to the drive.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Dan Williams
On Mon, Apr 25, 2016 at 10:14 AM, Verma, Vishal L
 wrote:
> On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
>> On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
>> >
>> > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
>> > due
>> > to some allocation failing, and I thought we should return the
>> > original
>> > -EIO in such cases so that the application doesn't lose the
>> > information
>> > that the bad block is actually causing the error.
>> EINVAL is a concern here.  Not due to the right error reported, but
>> because it means your current scheme is fundamentally broken - we
>> need to support I/O at any alignment for DAX I/O, and not fail due to
>> alignbment concernes for a highly specific degraded case.
>>
>> I think this whole series need to go back to the drawing board as I
>> don't think it can actually rely on using direct I/O as the EIO
>> fallback.
>>
> Agreed that DAX I/O can happen with any size/alignment, but how else do
> we send an IO through the driver without alignment restrictions? Also,
> the granularity at which we store badblocks is 512B sectors, so it
> seems natural that to clear such a sector, you'd expect to send a write
> to the whole sector.
>
> The expected usage flow is:
>
> - Application hits EIO doing dax_IO or load/store io
>
> - It checks badblocks and discovers it's files have lost data
>
> - It write()s those sectors (possibly converted to file offsets using
> fiemap)
> * This triggers the fallback path, but if the application is doing
> this level of recovery, it will know the sector is bad, and write the
> entire sector
>
> - Or it replaces the entire file from backup also using write() (not
> mmap+stores)
> * This just frees the fs block, and the next time the block is
> reallocated by the fs, it will likely be zeroed first, and that will be
> done through the driver and will clear errors
>
>
> I think if we want to keep allowing arbitrary alignments for the
> dax_do_io path, we'd need:
> 1. To represent badblocks at a finer granularity (likely cache lines)
> 2. To allow the driver to do IO to a *block device* at sub-sector
> granularity

3. Arrange for O_DIRECT to bypass dax_do_io(), and leave the
optimization only for the dax "buffered I/O" case.

4. Skip dax_do_io() entirely in the presence of errors

I think 3 is the most closely aligned with the typical block device
model.  In the typical case a buffered write may fail due to a
badblock read when filling the page cache, but an O_DIRECT write would
bypass the page cache and potentially clear the error / cause the
block to be reallocated internally to the drive.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Verma, Vishal L
On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> > 
> > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > due
> > to some allocation failing, and I thought we should return the
> > original
> > -EIO in such cases so that the application doesn't lose the
> > information
> > that the bad block is actually causing the error.
> EINVAL is a concern here.  Not due to the right error reported, but
> because it means your current scheme is fundamentally broken - we
> need to support I/O at any alignment for DAX I/O, and not fail due to
> alignbment concernes for a highly specific degraded case.
> 
> I think this whole series need to go back to the drawing board as I
> don't think it can actually rely on using direct I/O as the EIO
> fallback.
> 
Agreed that DAX I/O can happen with any size/alignment, but how else do
we send an IO through the driver without alignment restrictions? Also,
the granularity at which we store badblocks is 512B sectors, so it
seems natural that to clear such a sector, you'd expect to send a write
to the whole sector.

The expected usage flow is:

- Application hits EIO doing dax_IO or load/store io

- It checks badblocks and discovers it's files have lost data

- It write()s those sectors (possibly converted to file offsets using
fiemap)
    * This triggers the fallback path, but if the application is doing
this level of recovery, it will know the sector is bad, and write the
entire sector

- Or it replaces the entire file from backup also using write() (not
mmap+stores)
    * This just frees the fs block, and the next time the block is
reallocated by the fs, it will likely be zeroed first, and that will be
done through the driver and will clear errors


I think if we want to keep allowing arbitrary alignments for the
dax_do_io path, we'd need:
1. To represent badblocks at a finer granularity (likely cache lines)
2. To allow the driver to do IO to a *block device* at sub-sector
granularity

Can we do that?

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Verma, Vishal L
On Mon, 2016-04-25 at 01:31 -0700, h...@infradead.org wrote:
> On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> > 
> > direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM
> > due
> > to some allocation failing, and I thought we should return the
> > original
> > -EIO in such cases so that the application doesn't lose the
> > information
> > that the bad block is actually causing the error.
> EINVAL is a concern here.  Not due to the right error reported, but
> because it means your current scheme is fundamentally broken - we
> need to support I/O at any alignment for DAX I/O, and not fail due to
> alignbment concernes for a highly specific degraded case.
> 
> I think this whole series need to go back to the drawing board as I
> don't think it can actually rely on using direct I/O as the EIO
> fallback.
> 
Agreed that DAX I/O can happen with any size/alignment, but how else do
we send an IO through the driver without alignment restrictions? Also,
the granularity at which we store badblocks is 512B sectors, so it
seems natural that to clear such a sector, you'd expect to send a write
to the whole sector.

The expected usage flow is:

- Application hits EIO doing dax_IO or load/store io

- It checks badblocks and discovers it's files have lost data

- It write()s those sectors (possibly converted to file offsets using
fiemap)
    * This triggers the fallback path, but if the application is doing
this level of recovery, it will know the sector is bad, and write the
entire sector

- Or it replaces the entire file from backup also using write() (not
mmap+stores)
    * This just frees the fs block, and the next time the block is
reallocated by the fs, it will likely be zeroed first, and that will be
done through the driver and will clear errors


I think if we want to keep allowing arbitrary alignments for the
dax_do_io path, we'd need:
1. To represent badblocks at a finer granularity (likely cache lines)
2. To allow the driver to do IO to a *block device* at sub-sector
granularity

Can we do that?

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Jeff Moyer
"h...@infradead.org"  writes:

> On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
>> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
>> to some allocation failing, and I thought we should return the original
>> -EIO in such cases so that the application doesn't lose the information
>> that the bad block is actually causing the error.
>
> EINVAL is a concern here.  Not due to the right error reported, but
> because it means your current scheme is fundamentally broken - we
> need to support I/O at any alignment for DAX I/O, and not fail due to
> alignbment concernes for a highly specific degraded case.
>
> I think this whole series need to go back to the drawing board as I
> don't think it can actually rely on using direct I/O as the EIO
> fallback.

The only callers of dax_do_io are direct_IO methods.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread Jeff Moyer
"h...@infradead.org"  writes:

> On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
>> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
>> to some allocation failing, and I thought we should return the original
>> -EIO in such cases so that the application doesn't lose the information
>> that the bad block is actually causing the error.
>
> EINVAL is a concern here.  Not due to the right error reported, but
> because it means your current scheme is fundamentally broken - we
> need to support I/O at any alignment for DAX I/O, and not fail due to
> alignbment concernes for a highly specific degraded case.
>
> I think this whole series need to go back to the drawing board as I
> don't think it can actually rely on using direct I/O as the EIO
> fallback.

The only callers of dax_do_io are direct_IO methods.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread h...@infradead.org
On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
> to some allocation failing, and I thought we should return the original
> -EIO in such cases so that the application doesn't lose the information
> that the bad block is actually causing the error.

EINVAL is a concern here.  Not due to the right error reported, but
because it means your current scheme is fundamentally broken - we
need to support I/O at any alignment for DAX I/O, and not fail due to
alignbment concernes for a highly specific degraded case.

I think this whole series need to go back to the drawing board as I
don't think it can actually rely on using direct I/O as the EIO
fallback.



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-25 Thread h...@infradead.org
On Sat, Apr 23, 2016 at 06:08:37PM +, Verma, Vishal L wrote:
> direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
> to some allocation failing, and I thought we should return the original
> -EIO in such cases so that the application doesn't lose the information
> that the bad block is actually causing the error.

EINVAL is a concern here.  Not due to the right error reported, but
because it means your current scheme is fundamentally broken - we
need to support I/O at any alignment for DAX I/O, and not fail due to
alignbment concernes for a highly specific degraded case.

I think this whole series need to go back to the drawing board as I
don't think it can actually rely on using direct I/O as the EIO
fallback.



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-23 Thread Verma, Vishal L
On Wed, 2016-04-20 at 13:59 -0700, Christoph Hellwig wrote:
> On Fri, Apr 15, 2016 at 12:11:36PM -0400, Jeff Moyer wrote:
> > 
> > > 
> > > + if (IS_DAX(inode)) {
> > > + ret = dax_do_io(iocb, inode, iter, offset,
> > > blkdev_get_block,
> > >   NULL, DIO_SKIP_DIO_COUNT);
> > > + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > > + ret_saved = ret;
> > > + else
> > > + return ret;
> > > + }
> > > +
> > > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > > iter, offset,
> > >   blkdev_get_block, NULL,
> > > NULL,
> > >   DIO_SKIP_DIO_COUNT);
> > > + if (ret < 0 && ret_saved)
> > > + return ret_saved;
> > > +
> > Hmm, did you just break async DIO?  I think you did!  :)
> > __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
> > that
> > into -EIO.  Really, I don't see a reason to save that first
> > -EIO.  The
> > same applies to all instances in this patch.
> Yes, there is no point in saving the earlier error - just return the
> second error all the time.

Is it ok to do that?

direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
to some allocation failing, and I thought we should return the original
-EIO in such cases so that the application doesn't lose the information
that the bad block is actually causing the error.

> 
> E.g.
> 
>   ret = dax_io();
>   if (dax_need_dio_retry(ret))
>   ret = direct_IO();
> 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-23 Thread Verma, Vishal L
On Wed, 2016-04-20 at 13:59 -0700, Christoph Hellwig wrote:
> On Fri, Apr 15, 2016 at 12:11:36PM -0400, Jeff Moyer wrote:
> > 
> > > 
> > > + if (IS_DAX(inode)) {
> > > + ret = dax_do_io(iocb, inode, iter, offset,
> > > blkdev_get_block,
> > >   NULL, DIO_SKIP_DIO_COUNT);
> > > + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > > + ret_saved = ret;
> > > + else
> > > + return ret;
> > > + }
> > > +
> > > + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > > iter, offset,
> > >   blkdev_get_block, NULL,
> > > NULL,
> > >   DIO_SKIP_DIO_COUNT);
> > > + if (ret < 0 && ret_saved)
> > > + return ret_saved;
> > > +
> > Hmm, did you just break async DIO?  I think you did!  :)
> > __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
> > that
> > into -EIO.  Really, I don't see a reason to save that first
> > -EIO.  The
> > same applies to all instances in this patch.
> Yes, there is no point in saving the earlier error - just return the
> second error all the time.

Is it ok to do that?

direct_IO might fail with -EINVAL due to misalignment, or -ENOMEM due
to some allocation failing, and I thought we should return the original
-EIO in such cases so that the application doesn't lose the information
that the bad block is actually causing the error.

> 
> E.g.
> 
>   ret = dax_io();
>   if (dax_need_dio_retry(ret))
>   ret = direct_IO();
> 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-20 Thread Christoph Hellwig
On Fri, Apr 15, 2016 at 12:11:36PM -0400, Jeff Moyer wrote:
> > +   if (IS_DAX(inode)) {
> > +   ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> > NULL, DIO_SKIP_DIO_COUNT);
> > +   if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > +   ret_saved = ret;
> > +   else
> > +   return ret;
> > +   }
> > +
> > +   ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> > blkdev_get_block, NULL, NULL,
> > DIO_SKIP_DIO_COUNT);
> > +   if (ret < 0 && ret_saved)
> > +   return ret_saved;
> > +
> 
> Hmm, did you just break async DIO?  I think you did!  :)
> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned that
> into -EIO.  Really, I don't see a reason to save that first -EIO.  The
> same applies to all instances in this patch.

Yes, there is no point in saving the earlier error - just return the
second error all the time.

E.g.

ret = dax_io();
if (dax_need_dio_retry(ret))
ret = direct_IO();



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-20 Thread Christoph Hellwig
On Fri, Apr 15, 2016 at 12:11:36PM -0400, Jeff Moyer wrote:
> > +   if (IS_DAX(inode)) {
> > +   ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> > NULL, DIO_SKIP_DIO_COUNT);
> > +   if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > +   ret_saved = ret;
> > +   else
> > +   return ret;
> > +   }
> > +
> > +   ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> > blkdev_get_block, NULL, NULL,
> > DIO_SKIP_DIO_COUNT);
> > +   if (ret < 0 && ret_saved)
> > +   return ret_saved;
> > +
> 
> Hmm, did you just break async DIO?  I think you did!  :)
> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned that
> into -EIO.  Really, I don't see a reason to save that first -EIO.  The
> same applies to all instances in this patch.

Yes, there is no point in saving the earlier error - just return the
second error all the time.

E.g.

ret = dax_io();
if (dax_need_dio_retry(ret))
ret = direct_IO();



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Toshi Kani
On Fri, 2016-04-15 at 13:01 -0600, Toshi Kani wrote:
> On Fri, 2016-04-15 at 11:17 -0700, Dan Williams wrote:
> > 
> > On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
> > > 
> > > Dan Williams  writes:
> > >  
> > > > > > There's a lot of special casing here, so you might consider
> > > > > > adding comments.
> > > > > Correct - maybe we should reconsider wrapper-izing this? :)
> > > > Another option is just to skip dax_do_io() and this special casing
> > > > fallback entirely if errors are present.  I.e. only attempt
> > > > dax_do_io when: IS_DAX() && gendisk->bb && bb->count == 0.
> > >
> > > So, if there's an error anywhere on the device, penalize all I/O (not
> > > just writes, and not just on sectors that are bad)?  I'm not sure
> > > that's a great plan, either.
> > > 
> > If errors are rare how much are we actually losing in practice?
> > Moreover, we're going to do the full badblocks lookup anyway when we
> > call ->direct_access().  If we had that information earlier we can
> > avoid this fallback dance.
>
> A system running with DAX may have active data set in NVDIMM lager than
> RAM size.  In this case, falling back to non-DAX will allocate page cache
> for the data, which will saturate the system with memory pressure.

Oh, sorry, we are still in DIO path.  Falling back to DIO should not cause
this issue.

-Toshi


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Toshi Kani
On Fri, 2016-04-15 at 13:01 -0600, Toshi Kani wrote:
> On Fri, 2016-04-15 at 11:17 -0700, Dan Williams wrote:
> > 
> > On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
> > > 
> > > Dan Williams  writes:
> > >  
> > > > > > There's a lot of special casing here, so you might consider
> > > > > > adding comments.
> > > > > Correct - maybe we should reconsider wrapper-izing this? :)
> > > > Another option is just to skip dax_do_io() and this special casing
> > > > fallback entirely if errors are present.  I.e. only attempt
> > > > dax_do_io when: IS_DAX() && gendisk->bb && bb->count == 0.
> > >
> > > So, if there's an error anywhere on the device, penalize all I/O (not
> > > just writes, and not just on sectors that are bad)?  I'm not sure
> > > that's a great plan, either.
> > > 
> > If errors are rare how much are we actually losing in practice?
> > Moreover, we're going to do the full badblocks lookup anyway when we
> > call ->direct_access().  If we had that information earlier we can
> > avoid this fallback dance.
>
> A system running with DAX may have active data set in NVDIMM lager than
> RAM size.  In this case, falling back to non-DAX will allocate page cache
> for the data, which will saturate the system with memory pressure.

Oh, sorry, we are still in DIO path.  Falling back to DIO should not cause
this issue.

-Toshi


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Apr 15, 2016 at 11:24 AM, Jeff Moyer  wrote:
>>> Moreover, we're going to do the full badblocks lookup anyway when we
>>> call ->direct_access().  If we had that information earlier we can
>>> avoid this fallback dance.
>>
>> None of the proposed approaches looks clean to me.  I'll go along with
>> whatever you guys think is best.  I am in favor of wrapping up all that
>> duplicated code, though.
>
> Christoph originally pushed for open coding this fallback decision
> per-filesystem.  I agree with you on the "none the above" options are
> clean.

I don't recall him saying "open code".  Rather, the sentiment was to
leave the fallback to the callers.  That doesn't mean you can't wrap it
up in a convenience function.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Apr 15, 2016 at 11:24 AM, Jeff Moyer  wrote:
>>> Moreover, we're going to do the full badblocks lookup anyway when we
>>> call ->direct_access().  If we had that information earlier we can
>>> avoid this fallback dance.
>>
>> None of the proposed approaches looks clean to me.  I'll go along with
>> whatever you guys think is best.  I am in favor of wrapping up all that
>> duplicated code, though.
>
> Christoph originally pushed for open coding this fallback decision
> per-filesystem.  I agree with you on the "none the above" options are
> clean.

I don't recall him saying "open code".  Rather, the sentiment was to
leave the fallback to the callers.  That doesn't mean you can't wrap it
up in a convenience function.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Toshi Kani
On Fri, 2016-04-15 at 11:17 -0700, Dan Williams wrote:
> On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
> > 
> > Dan Williams  writes:
> > 
> > > > > There's a lot of special casing here, so you might consider
> > > > > adding comments.
> > > > Correct - maybe we should reconsider wrapper-izing this? :)
> > > Another option is just to skip dax_do_io() and this special casing
> > > fallback entirely if errors are present.  I.e. only attempt dax_do_io
> > > when: IS_DAX() && gendisk->bb && bb->count == 0.
> >
> > So, if there's an error anywhere on the device, penalize all I/O (not
> > just writes, and not just on sectors that are bad)?  I'm not sure
> > that's a great plan, either.
> > 
> If errors are rare how much are we actually losing in practice?
> Moreover, we're going to do the full badblocks lookup anyway when we
> call ->direct_access().  If we had that information earlier we can
> avoid this fallback dance.

A system running with DAX may have active data set in NVDIMM lager than RAM
size.  In this case, falling back to non-DAX will allocate page cache for
the data, which will saturate the system with memory pressure.

Thanks,
-Toshi  



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Toshi Kani
On Fri, 2016-04-15 at 11:17 -0700, Dan Williams wrote:
> On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
> > 
> > Dan Williams  writes:
> > 
> > > > > There's a lot of special casing here, so you might consider
> > > > > adding comments.
> > > > Correct - maybe we should reconsider wrapper-izing this? :)
> > > Another option is just to skip dax_do_io() and this special casing
> > > fallback entirely if errors are present.  I.e. only attempt dax_do_io
> > > when: IS_DAX() && gendisk->bb && bb->count == 0.
> >
> > So, if there's an error anywhere on the device, penalize all I/O (not
> > just writes, and not just on sectors that are bad)?  I'm not sure
> > that's a great plan, either.
> > 
> If errors are rare how much are we actually losing in practice?
> Moreover, we're going to do the full badblocks lookup anyway when we
> call ->direct_access().  If we had that information earlier we can
> avoid this fallback dance.

A system running with DAX may have active data set in NVDIMM lager than RAM
size.  In this case, falling back to non-DAX will allocate page cache for
the data, which will saturate the system with memory pressure.

Thanks,
-Toshi  



Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Dan Williams
On Fri, Apr 15, 2016 at 11:24 AM, Jeff Moyer  wrote:
>> Moreover, we're going to do the full badblocks lookup anyway when we
>> call ->direct_access().  If we had that information earlier we can
>> avoid this fallback dance.
>
> None of the proposed approaches looks clean to me.  I'll go along with
> whatever you guys think is best.  I am in favor of wrapping up all that
> duplicated code, though.

Christoph originally pushed for open coding this fallback decision
per-filesystem.  I agree with you on the "none the above" options are
clean.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Dan Williams
On Fri, Apr 15, 2016 at 11:24 AM, Jeff Moyer  wrote:
>> Moreover, we're going to do the full badblocks lookup anyway when we
>> call ->direct_access().  If we had that information earlier we can
>> avoid this fallback dance.
>
> None of the proposed approaches looks clean to me.  I'll go along with
> whatever you guys think is best.  I am in favor of wrapping up all that
> duplicated code, though.

Christoph originally pushed for open coding this fallback decision
per-filesystem.  I agree with you on the "none the above" options are
clean.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
>> Dan Williams  writes:
>>
> There's a lot of special casing here, so you might consider adding
> comments.

 Correct - maybe we should reconsider wrapper-izing this? :)
>>>
>>> Another option is just to skip dax_do_io() and this special casing
>>> fallback entirely if errors are present.  I.e. only attempt dax_do_io
>>> when: IS_DAX() && gendisk->bb && bb->count == 0.
>>
>> So, if there's an error anywhere on the device, penalize all I/O (not
>> just writes, and not just on sectors that are bad)?  I'm not sure that's
>> a great plan, either.
>>
>
> If errors are rare how much are we actually losing in practice?

How long is a piece of string?

> Moreover, we're going to do the full badblocks lookup anyway when we
> call ->direct_access().  If we had that information earlier we can
> avoid this fallback dance.

None of the proposed approaches looks clean to me.  I'll go along with
whatever you guys think is best.  I am in favor of wrapping up all that
duplicated code, though.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Dan Williams  writes:

> On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
>> Dan Williams  writes:
>>
> There's a lot of special casing here, so you might consider adding
> comments.

 Correct - maybe we should reconsider wrapper-izing this? :)
>>>
>>> Another option is just to skip dax_do_io() and this special casing
>>> fallback entirely if errors are present.  I.e. only attempt dax_do_io
>>> when: IS_DAX() && gendisk->bb && bb->count == 0.
>>
>> So, if there's an error anywhere on the device, penalize all I/O (not
>> just writes, and not just on sectors that are bad)?  I'm not sure that's
>> a great plan, either.
>>
>
> If errors are rare how much are we actually losing in practice?

How long is a piece of string?

> Moreover, we're going to do the full badblocks lookup anyway when we
> call ->direct_access().  If we had that information earlier we can
> avoid this fallback dance.

None of the proposed approaches looks clean to me.  I'll go along with
whatever you guys think is best.  I am in favor of wrapping up all that
duplicated code, though.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Dan Williams
On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
> Dan Williams  writes:
>
 There's a lot of special casing here, so you might consider adding
 comments.
>>>
>>> Correct - maybe we should reconsider wrapper-izing this? :)
>>
>> Another option is just to skip dax_do_io() and this special casing
>> fallback entirely if errors are present.  I.e. only attempt dax_do_io
>> when: IS_DAX() && gendisk->bb && bb->count == 0.
>
> So, if there's an error anywhere on the device, penalize all I/O (not
> just writes, and not just on sectors that are bad)?  I'm not sure that's
> a great plan, either.
>

If errors are rare how much are we actually losing in practice?
Moreover, we're going to do the full badblocks lookup anyway when we
call ->direct_access().  If we had that information earlier we can
avoid this fallback dance.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Dan Williams
On Fri, Apr 15, 2016 at 11:06 AM, Jeff Moyer  wrote:
> Dan Williams  writes:
>
 There's a lot of special casing here, so you might consider adding
 comments.
>>>
>>> Correct - maybe we should reconsider wrapper-izing this? :)
>>
>> Another option is just to skip dax_do_io() and this special casing
>> fallback entirely if errors are present.  I.e. only attempt dax_do_io
>> when: IS_DAX() && gendisk->bb && bb->count == 0.
>
> So, if there's an error anywhere on the device, penalize all I/O (not
> just writes, and not just on sectors that are bad)?  I'm not sure that's
> a great plan, either.
>

If errors are rare how much are we actually losing in practice?
Moreover, we're going to do the full badblocks lookup anyway when we
call ->direct_access().  If we had that information earlier we can
avoid this fallback dance.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Dan Williams  writes:

>>> There's a lot of special casing here, so you might consider adding
>>> comments.
>>
>> Correct - maybe we should reconsider wrapper-izing this? :)
>
> Another option is just to skip dax_do_io() and this special casing
> fallback entirely if errors are present.  I.e. only attempt dax_do_io
> when: IS_DAX() && gendisk->bb && bb->count == 0.

So, if there's an error anywhere on the device, penalize all I/O (not
just writes, and not just on sectors that are bad)?  I'm not sure that's
a great plan, either.

-Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Dan Williams  writes:

>>> There's a lot of special casing here, so you might consider adding
>>> comments.
>>
>> Correct - maybe we should reconsider wrapper-izing this? :)
>
> Another option is just to skip dax_do_io() and this special casing
> fallback entirely if errors are present.  I.e. only attempt dax_do_io
> when: IS_DAX() && gendisk->bb && bb->count == 0.

So, if there's an error anywhere on the device, penalize all I/O (not
just writes, and not just on sectors that are bad)?  I'm not sure that's
a great plan, either.

-Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Dan Williams
On Fri, Apr 15, 2016 at 10:37 AM, Verma, Vishal L
 wrote:
> On Fri, 2016-04-15 at 13:11 -0400, Jeff Moyer wrote:
[..]
>> >
>> > But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
>> For async direct I/O, only the setup phase of the I/O is performed
>> and
>> then we return to the caller.  -EIOCBQUEUED signifies this.
>>
>> You're heading towards code that looks like this:
>>
>> if (IS_DAX(inode)) {
>> ret = dax_do_io(iocb, inode, iter, offset,
>> blkdev_get_block,
>> NULL, DIO_SKIP_DIO_COUNT);
>> if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
>> ret_saved = ret;
>> else
>> return ret;
>> }
>>
>> ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
>> offset,
>> blkdev_get_block, NULL, NULL,
>> DIO_SKIP_DIO_COUNT);
>> if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
>> return ret_saved;
>>
>> There's a lot of special casing here, so you might consider adding
>> comments.
>
> Correct - maybe we should reconsider wrapper-izing this? :)

Another option is just to skip dax_do_io() and this special casing
fallback entirely if errors are present.  I.e. only attempt dax_do_io
when: IS_DAX() && gendisk->bb && bb->count == 0.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Dan Williams
On Fri, Apr 15, 2016 at 10:37 AM, Verma, Vishal L
 wrote:
> On Fri, 2016-04-15 at 13:11 -0400, Jeff Moyer wrote:
[..]
>> >
>> > But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
>> For async direct I/O, only the setup phase of the I/O is performed
>> and
>> then we return to the caller.  -EIOCBQUEUED signifies this.
>>
>> You're heading towards code that looks like this:
>>
>> if (IS_DAX(inode)) {
>> ret = dax_do_io(iocb, inode, iter, offset,
>> blkdev_get_block,
>> NULL, DIO_SKIP_DIO_COUNT);
>> if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
>> ret_saved = ret;
>> else
>> return ret;
>> }
>>
>> ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
>> offset,
>> blkdev_get_block, NULL, NULL,
>> DIO_SKIP_DIO_COUNT);
>> if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
>> return ret_saved;
>>
>> There's a lot of special casing here, so you might consider adding
>> comments.
>
> Correct - maybe we should reconsider wrapper-izing this? :)

Another option is just to skip dax_do_io() and this special casing
fallback entirely if errors are present.  I.e. only attempt dax_do_io
when: IS_DAX() && gendisk->bb && bb->count == 0.


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Verma, Vishal L
On Fri, 2016-04-15 at 13:11 -0400, Jeff Moyer wrote:
> "Verma, Vishal L"  writes:
> 
> > 
> > On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
> > > 
> > > Vishal Verma  writes:
> > > > 
> > > > +   if (IS_DAX(inode)) {
> > > > +   ret = dax_do_io(iocb, inode, iter, offset,
> > > > blkdev_get_block,
> > > >     NULL, DIO_SKIP_DIO_COUNT);
> > > > -   return __blockdev_direct_IO(iocb, inode,
> > > > I_BDEV(inode),
> > > > iter, offset,
> > > > +   if (ret == -EIO && (iov_iter_rw(iter) ==
> > > > WRITE))
> > > > +   ret_saved = ret;
> > > > +   else
> > > > +   return ret;
> > > > +   }
> > > > +
> > > > +   ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > > > iter, offset,
> > > >     blkdev_get_block, NULL,
> > > > NULL,
> > > >     DIO_SKIP_DIO_COUNT);
> > > > +   if (ret < 0 && ret_saved)
> > > > +   return ret_saved;
> > > > +
> > > Hmm, did you just break async DIO?  I think you did!  :)
> > > __blockdev_direct_IO can return -EIOCBQUEUED, and you've now
> > > turned
> > > that
> > > into -EIO.  Really, I don't see a reason to save that first
> > > -EIO.  The
> > > same applies to all instances in this patch.
> > The reason I saved it was if __blockdev_direct_IO fails for some
> > reason, we should return the original cause o the error, which was
> > an
> > EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails
> > with
> > something else..
> OK.
> 
> > 
> > But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
> For async direct I/O, only the setup phase of the I/O is performed
> and
> then we return to the caller.  -EIOCBQUEUED signifies this.
> 
> You're heading towards code that looks like this:
> 
> if (IS_DAX(inode)) {
> ret = dax_do_io(iocb, inode, iter, offset,
> blkdev_get_block,
> NULL, DIO_SKIP_DIO_COUNT);
> if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> ret_saved = ret;
> else
> return ret;
> }
> 
> ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
> offset,
> blkdev_get_block, NULL, NULL,
> DIO_SKIP_DIO_COUNT);
> if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
> return ret_saved;
> 
> There's a lot of special casing here, so you might consider adding
> comments.

Correct - maybe we should reconsider wrapper-izing this? :)

Thanks for the explanation and for catching this. I'll fix it for the
next revision.

> 
> Cheers,
> Jeff

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Verma, Vishal L
On Fri, 2016-04-15 at 13:11 -0400, Jeff Moyer wrote:
> "Verma, Vishal L"  writes:
> 
> > 
> > On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
> > > 
> > > Vishal Verma  writes:
> > > > 
> > > > +   if (IS_DAX(inode)) {
> > > > +   ret = dax_do_io(iocb, inode, iter, offset,
> > > > blkdev_get_block,
> > > >     NULL, DIO_SKIP_DIO_COUNT);
> > > > -   return __blockdev_direct_IO(iocb, inode,
> > > > I_BDEV(inode),
> > > > iter, offset,
> > > > +   if (ret == -EIO && (iov_iter_rw(iter) ==
> > > > WRITE))
> > > > +   ret_saved = ret;
> > > > +   else
> > > > +   return ret;
> > > > +   }
> > > > +
> > > > +   ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > > > iter, offset,
> > > >     blkdev_get_block, NULL,
> > > > NULL,
> > > >     DIO_SKIP_DIO_COUNT);
> > > > +   if (ret < 0 && ret_saved)
> > > > +   return ret_saved;
> > > > +
> > > Hmm, did you just break async DIO?  I think you did!  :)
> > > __blockdev_direct_IO can return -EIOCBQUEUED, and you've now
> > > turned
> > > that
> > > into -EIO.  Really, I don't see a reason to save that first
> > > -EIO.  The
> > > same applies to all instances in this patch.
> > The reason I saved it was if __blockdev_direct_IO fails for some
> > reason, we should return the original cause o the error, which was
> > an
> > EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails
> > with
> > something else..
> OK.
> 
> > 
> > But, how does _EIOCBQUEUED work? Maybe we need an exception for it?
> For async direct I/O, only the setup phase of the I/O is performed
> and
> then we return to the caller.  -EIOCBQUEUED signifies this.
> 
> You're heading towards code that looks like this:
> 
> if (IS_DAX(inode)) {
> ret = dax_do_io(iocb, inode, iter, offset,
> blkdev_get_block,
> NULL, DIO_SKIP_DIO_COUNT);
> if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> ret_saved = ret;
> else
> return ret;
> }
> 
> ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter,
> offset,
> blkdev_get_block, NULL, NULL,
> DIO_SKIP_DIO_COUNT);
> if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
> return ret_saved;
> 
> There's a lot of special casing here, so you might consider adding
> comments.

Correct - maybe we should reconsider wrapper-izing this? :)

Thanks for the explanation and for catching this. I'll fix it for the
next revision.

> 
> Cheers,
> Jeff

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
"Verma, Vishal L"  writes:

> On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
>> Vishal Verma  writes:
>> > +  if (IS_DAX(inode)) {
>> > +  ret = dax_do_io(iocb, inode, iter, offset,
>> > blkdev_get_block,
>> >    NULL, DIO_SKIP_DIO_COUNT);
>> > -  return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
>> > iter, offset,
>> > +  if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
>> > +  ret_saved = ret;
>> > +  else
>> > +  return ret;
>> > +  }
>> > +
>> > +  ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
>> > iter, offset,
>> >    blkdev_get_block, NULL, NULL,
>> >    DIO_SKIP_DIO_COUNT);
>> > +  if (ret < 0 && ret_saved)
>> > +  return ret_saved;
>> > +
>> Hmm, did you just break async DIO?  I think you did!  :)
>> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
>> that
>> into -EIO.  Really, I don't see a reason to save that first
>> -EIO.  The
>> same applies to all instances in this patch.
>
> The reason I saved it was if __blockdev_direct_IO fails for some
> reason, we should return the original cause o the error, which was an
> EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails with
> something else..

OK.

> But, how does _EIOCBQUEUED work? Maybe we need an exception for it?

For async direct I/O, only the setup phase of the I/O is performed and
then we return to the caller.  -EIOCBQUEUED signifies this.

You're heading towards code that looks like this:

if (IS_DAX(inode)) {
ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
NULL, DIO_SKIP_DIO_COUNT);
if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
ret_saved = ret;
else
return ret;
}

ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
blkdev_get_block, NULL, NULL,
DIO_SKIP_DIO_COUNT);
if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
return ret_saved;

There's a lot of special casing here, so you might consider adding
comments.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
"Verma, Vishal L"  writes:

> On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
>> Vishal Verma  writes:
>> > +  if (IS_DAX(inode)) {
>> > +  ret = dax_do_io(iocb, inode, iter, offset,
>> > blkdev_get_block,
>> >    NULL, DIO_SKIP_DIO_COUNT);
>> > -  return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
>> > iter, offset,
>> > +  if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
>> > +  ret_saved = ret;
>> > +  else
>> > +  return ret;
>> > +  }
>> > +
>> > +  ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
>> > iter, offset,
>> >    blkdev_get_block, NULL, NULL,
>> >    DIO_SKIP_DIO_COUNT);
>> > +  if (ret < 0 && ret_saved)
>> > +  return ret_saved;
>> > +
>> Hmm, did you just break async DIO?  I think you did!  :)
>> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
>> that
>> into -EIO.  Really, I don't see a reason to save that first
>> -EIO.  The
>> same applies to all instances in this patch.
>
> The reason I saved it was if __blockdev_direct_IO fails for some
> reason, we should return the original cause o the error, which was an
> EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails with
> something else..

OK.

> But, how does _EIOCBQUEUED work? Maybe we need an exception for it?

For async direct I/O, only the setup phase of the I/O is performed and
then we return to the caller.  -EIOCBQUEUED signifies this.

You're heading towards code that looks like this:

if (IS_DAX(inode)) {
ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
NULL, DIO_SKIP_DIO_COUNT);
if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
ret_saved = ret;
else
return ret;
}

ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
blkdev_get_block, NULL, NULL,
DIO_SKIP_DIO_COUNT);
if (ret < 0 && ret != -EIOCBQUEUED && ret_saved)
return ret_saved;

There's a lot of special casing here, so you might consider adding
comments.

Cheers,
Jeff


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Verma, Vishal L
On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
> Vishal Verma  writes:
> 
> > 
> > dax_do_io (called for read() or write() for a dax file system) may
> > fail
> > in the presence of bad blocks or media errors. Since we expect that
> > a
> > write should clear media errors on nvdimms, make dax_do_io fall
> > back to
> > the direct_IO path, which will send down a bio to the driver, which
> > can
> > then attempt to clear the error.
> [snip]
> 
> > 
> > +   if (IS_DAX(inode)) {
> > +   ret = dax_do_io(iocb, inode, iter, offset,
> > blkdev_get_block,
> >     NULL, DIO_SKIP_DIO_COUNT);
> > -   return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> > +   if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > +   ret_saved = ret;
> > +   else
> > +   return ret;
> > +   }
> > +
> > +   ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> >     blkdev_get_block, NULL, NULL,
> >     DIO_SKIP_DIO_COUNT);
> > +   if (ret < 0 && ret_saved)
> > +   return ret_saved;
> > +
> Hmm, did you just break async DIO?  I think you did!  :)
> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
> that
> into -EIO.  Really, I don't see a reason to save that first
> -EIO.  The
> same applies to all instances in this patch.

The reason I saved it was if __blockdev_direct_IO fails for some
reason, we should return the original cause o the error, which was an
EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails with
something else..
But, how does _EIOCBQUEUED work? Maybe we need an exception for it? 

Thanks,
-Vishal

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Verma, Vishal L
On Fri, 2016-04-15 at 12:11 -0400, Jeff Moyer wrote:
> Vishal Verma  writes:
> 
> > 
> > dax_do_io (called for read() or write() for a dax file system) may
> > fail
> > in the presence of bad blocks or media errors. Since we expect that
> > a
> > write should clear media errors on nvdimms, make dax_do_io fall
> > back to
> > the direct_IO path, which will send down a bio to the driver, which
> > can
> > then attempt to clear the error.
> [snip]
> 
> > 
> > +   if (IS_DAX(inode)) {
> > +   ret = dax_do_io(iocb, inode, iter, offset,
> > blkdev_get_block,
> >     NULL, DIO_SKIP_DIO_COUNT);
> > -   return __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> > +   if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> > +   ret_saved = ret;
> > +   else
> > +   return ret;
> > +   }
> > +
> > +   ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode),
> > iter, offset,
> >     blkdev_get_block, NULL, NULL,
> >     DIO_SKIP_DIO_COUNT);
> > +   if (ret < 0 && ret_saved)
> > +   return ret_saved;
> > +
> Hmm, did you just break async DIO?  I think you did!  :)
> __blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned
> that
> into -EIO.  Really, I don't see a reason to save that first
> -EIO.  The
> same applies to all instances in this patch.

The reason I saved it was if __blockdev_direct_IO fails for some
reason, we should return the original cause o the error, which was an
EIO.. i.e. we shouldn't be hiding the EIO if the direct_IO fails with
something else..
But, how does _EIOCBQUEUED work? Maybe we need an exception for it? 

Thanks,
-Vishal

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Vishal Verma  writes:

> dax_do_io (called for read() or write() for a dax file system) may fail
> in the presence of bad blocks or media errors. Since we expect that a
> write should clear media errors on nvdimms, make dax_do_io fall back to
> the direct_IO path, which will send down a bio to the driver, which can
> then attempt to clear the error.

[snip]

> + if (IS_DAX(inode)) {
> + ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
>   NULL, DIO_SKIP_DIO_COUNT);
> - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> + ret_saved = ret;
> + else
> + return ret;
> + }
> +
> + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
>   blkdev_get_block, NULL, NULL,
>   DIO_SKIP_DIO_COUNT);
> + if (ret < 0 && ret_saved)
> + return ret_saved;
> +

Hmm, did you just break async DIO?  I think you did!  :)
__blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned that
into -EIO.  Really, I don't see a reason to save that first -EIO.  The
same applies to all instances in this patch.

Cheers,
Jeff


> + return ret;
>  }
>  
>  int __sync_blockdev(struct block_device *bdev, int wait)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 824f249..64792c6 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -859,14 +859,22 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, loff_t offset)
>   struct address_space *mapping = file->f_mapping;
>   struct inode *inode = mapping->host;
>   size_t count = iov_iter_count(iter);
> - ssize_t ret;
> + ssize_t ret, ret_saved = 0;
>  
> - if (IS_DAX(inode))
> - ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
> - DIO_LOCKING);
> - else
> - ret = blockdev_direct_IO(iocb, inode, iter, offset,
> -  ext2_get_block);
> + if (IS_DAX(inode)) {
> + ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block,
> + NULL, DIO_LOCKING | DIO_SKIP_HOLES);
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto out;
> + }
> +
> + ret = blockdev_direct_IO(iocb, inode, iter, offset, ext2_get_block);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +
> + out:
>   if (ret < 0 && iov_iter_rw(iter) == WRITE)
>   ext2_write_failed(mapping, offset + count);
>   return ret;
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 3027fa6..798f341 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -716,14 +716,22 @@ retry:
>  NULL, NULL, 0);
>   inode_dio_end(inode);
>   } else {
> + ssize_t ret_saved = 0;
> +
>  locked:
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
>   ret = dax_do_io(iocb, inode, iter, offset,
>   ext4_dio_get_block, NULL, DIO_LOCKING);
> - else
> - ret = blockdev_direct_IO(iocb, inode, iter, offset,
> -  ext4_dio_get_block);
> -
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto skip_dio;
> + }
> + ret = blockdev_direct_IO(iocb, inode, iter, offset,
> +  ext4_get_block);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +skip_dio:
>   if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
>   loff_t isize = i_size_read(inode);
>   loff_t end = offset + count;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index dab84a2..27f07c2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3341,7 +3341,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter,
>  {
>   struct file *file = iocb->ki_filp;
>   struct inode *inode = file->f_mapping->host;
> - ssize_t ret;
> + ssize_t ret, ret_saved = 0;
>   size_t count = iov_iter_count(iter);
>   int overwrite = 0;
>   get_block_t *get_block_func = NULL;
> @@ -3401,15 +3401,22 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter,
>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>   BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
>  #endif
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
>   ret = dax_do_io(iocb, inode, iter, offset, 

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-04-15 Thread Jeff Moyer
Vishal Verma  writes:

> dax_do_io (called for read() or write() for a dax file system) may fail
> in the presence of bad blocks or media errors. Since we expect that a
> write should clear media errors on nvdimms, make dax_do_io fall back to
> the direct_IO path, which will send down a bio to the driver, which can
> then attempt to clear the error.

[snip]

> + if (IS_DAX(inode)) {
> + ret = dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
>   NULL, DIO_SKIP_DIO_COUNT);
> - return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> + if (ret == -EIO && (iov_iter_rw(iter) == WRITE))
> + ret_saved = ret;
> + else
> + return ret;
> + }
> +
> + ret = __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
>   blkdev_get_block, NULL, NULL,
>   DIO_SKIP_DIO_COUNT);
> + if (ret < 0 && ret_saved)
> + return ret_saved;
> +

Hmm, did you just break async DIO?  I think you did!  :)
__blockdev_direct_IO can return -EIOCBQUEUED, and you've now turned that
into -EIO.  Really, I don't see a reason to save that first -EIO.  The
same applies to all instances in this patch.

Cheers,
Jeff


> + return ret;
>  }
>  
>  int __sync_blockdev(struct block_device *bdev, int wait)
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 824f249..64792c6 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -859,14 +859,22 @@ ext2_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, loff_t offset)
>   struct address_space *mapping = file->f_mapping;
>   struct inode *inode = mapping->host;
>   size_t count = iov_iter_count(iter);
> - ssize_t ret;
> + ssize_t ret, ret_saved = 0;
>  
> - if (IS_DAX(inode))
> - ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block, NULL,
> - DIO_LOCKING);
> - else
> - ret = blockdev_direct_IO(iocb, inode, iter, offset,
> -  ext2_get_block);
> + if (IS_DAX(inode)) {
> + ret = dax_do_io(iocb, inode, iter, offset, ext2_get_block,
> + NULL, DIO_LOCKING | DIO_SKIP_HOLES);
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto out;
> + }
> +
> + ret = blockdev_direct_IO(iocb, inode, iter, offset, ext2_get_block);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +
> + out:
>   if (ret < 0 && iov_iter_rw(iter) == WRITE)
>   ext2_write_failed(mapping, offset + count);
>   return ret;
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 3027fa6..798f341 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -716,14 +716,22 @@ retry:
>  NULL, NULL, 0);
>   inode_dio_end(inode);
>   } else {
> + ssize_t ret_saved = 0;
> +
>  locked:
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
>   ret = dax_do_io(iocb, inode, iter, offset,
>   ext4_dio_get_block, NULL, DIO_LOCKING);
> - else
> - ret = blockdev_direct_IO(iocb, inode, iter, offset,
> -  ext4_dio_get_block);
> -
> + if (ret == -EIO && iov_iter_rw(iter) == WRITE)
> + ret_saved = ret;
> + else
> + goto skip_dio;
> + }
> + ret = blockdev_direct_IO(iocb, inode, iter, offset,
> +  ext4_get_block);
> + if (ret < 0 && ret_saved)
> + ret = ret_saved;
> +skip_dio:
>   if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
>   loff_t isize = i_size_read(inode);
>   loff_t end = offset + count;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index dab84a2..27f07c2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3341,7 +3341,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter,
>  {
>   struct file *file = iocb->ki_filp;
>   struct inode *inode = file->f_mapping->host;
> - ssize_t ret;
> + ssize_t ret, ret_saved = 0;
>   size_t count = iov_iter_count(iter);
>   int overwrite = 0;
>   get_block_t *get_block_func = NULL;
> @@ -3401,15 +3401,22 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter,
>  #ifdef CONFIG_EXT4_FS_ENCRYPTION
>   BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
>  #endif
> - if (IS_DAX(inode))
> + if (IS_DAX(inode)) {
>   ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
>   

Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-03-30 Thread Christoph Hellwig
On Wed, Mar 30, 2016 at 12:54:37AM -0600, Vishal Verma wrote:
> On Tue, 2016-03-29 at 23:34 -0700, Christoph Hellwig wrote:
> > Hi Vishal,
> > 
> > still NAK to calling the direct I/O code directly from the dax code.
> 
> Hm, I thought this was what you meant -- do the fallback/retry attempts
> at the callers of dax_do_io instead of the new dax wrapper function..
> Did I misunderstand you?

Sorry, it is.  I misread fs/block_dev.c as fs/dax.c before my first
coffee this morning.  I'll properly review the series in the afternoon..


Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

2016-03-30 Thread Christoph Hellwig
On Wed, Mar 30, 2016 at 12:54:37AM -0600, Vishal Verma wrote:
> On Tue, 2016-03-29 at 23:34 -0700, Christoph Hellwig wrote:
> > Hi Vishal,
> > 
> > still NAK to calling the direct I/O code directly from the dax code.
> 
> Hm, I thought this was what you meant -- do the fallback/retry attempts
> at the callers of dax_do_io instead of the new dax wrapper function..
> Did I misunderstand you?

Sorry, it is.  I misread fs/block_dev.c as fs/dax.c before my first
coffee this morning.  I'll properly review the series in the afternoon..


  1   2   >