Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-31 Thread Ross Zwisler
On Tue, Oct 31, 2017 at 02:50:01PM -0700, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara  wrote:
> > On Fri 27-10-17 12:08:34, Jan Kara wrote:
> >> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> > > > > index f179bdf1644d..b43be199fbdf 100644
> >> > > > > --- a/fs/xfs/xfs_iomap.c
> >> > > > > +++ b/fs/xfs/xfs_iomap.c
> >> > > > > @@ -33,6 +33,7 @@
> >> > > > >  #include "xfs_error.h"
> >> > > > >  #include "xfs_trans.h"
> >> > > > >  #include "xfs_trans_space.h"
> >> > > > > +#include "xfs_inode_item.h"
> >> > > > >  #include "xfs_iomap.h"
> >> > > > >  #include "xfs_trace.h"
> >> > > > >  #include "xfs_icache.h"
> >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> >> > > > >   trace_xfs_iomap_found(ip, offset, length, 0, );
> >> > > > >   }
> >> > > > >
> >> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> >> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> >> > > > > + iomap->flags |= IOMAP_F_DIRTY;
> >> > > >
> >> > > > This is the very definition of an inode that is "fdatasync dirty".
> >> > > >
> >> > > > H, shouldn't this also be set for read faults, too?
> >> > >
> >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot 
> >> > > write any
> >> > > data to the page which he'd then like to be persistent. The only 
> >> > > reason why
> >> > > I thought it could be useful for a while was that it would be nice to 
> >> > > make
> >> > > MAP_SYNC mapping provide the guarantee that data you see now is the 
> >> > > data
> >> > > you'll see after a crash
> >> >
> >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
> >> > from a page fault, the app knows that the data and it's underlying
> >> > extent is on persistent storage?
> >> >
> >> > > but we cannot provide that guarantee for RO
> >> > > mapping anyway if someone else has the page mapped as well. So I just
> >> > > decided not to return IOMAP_F_DIRTY for read faults.
> >> >
> >> > If there are multiple MAP_SYNC mappings to the inode, I would have
> >> > expected that they all sync all of the data/metadata on every page
> >> > fault, regardless of who dirtied the inode. An RO mapping doesn't
> >>
> >> Well, they all do sync regardless of who dirtied the inode on every *write*
> >> fault.
> >>
> >> > mean the data/metadata on the inode can't change, it just means it
> >> > can't change through that mapping.  Running fsync() to guarantee the
> >> > persistence of that data/metadata doesn't actually changing any
> >> > data
> >> >
> >> > IOWs, if read faults don't guarantee the mapped range has stable
> >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> >> > because it's not giving consistent guarantees to userspace. Yes, it
> >> > works fine when only one MAP_SYNC mapping is modifying the inode,
> >> > but the moment we have concurrent operations on the inode that
> >> > aren't MAP_SYNC or O_SYNC this goes out the window
> >>
> >> MAP_SYNC as I've implemented it provides guarantees only for data the
> >> process has actually written. I agree with that and it was a conscious
> >> decision. In my opinion that covers most usecases, provides reasonably
> >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
> >> persist it just using CPU instructions), and reasonable performance.
> >>
> >> Now you seem to suggest the semantics should be: "Data you have read from 
> >> or
> >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
> >> from implementation POV we can do that rather easily (just rip out the
> >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
> >> be useful enough to justify the slowdown of read faults? I was not able to
> >> come up with a good usecase and so I've decided for current semantics. What
> >> do other people think?
> >
> > Nobody commented on this for couple of days so how do we proceed? I would
> > prefer to go just with a guarantee for data written and we can always make
> > the guarantee stronger (i.e. apply it also for read data) when some user
> > comes with a good usecase?
> 
> I think it is easier to strengthen the guarantee than loosen it later
> especially since it is not yet clear that we have a use case for the
> stronger semantic. At least the initial motivation for MAP_SYNC was
> for writers.

I agree.  It seems like all threads/processes in a given application need to
use MAP_SYNC consistently so they can be sure that data that is written (and
then possibly read) will be durable on media.  I think what you have is a good
starting point, and we can adjust later if necessary.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-31 Thread Dan Williams
On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara  wrote:
> On Fri 27-10-17 12:08:34, Jan Kara wrote:
>> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
>> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
>> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> > > > > index f179bdf1644d..b43be199fbdf 100644
>> > > > > --- a/fs/xfs/xfs_iomap.c
>> > > > > +++ b/fs/xfs/xfs_iomap.c
>> > > > > @@ -33,6 +33,7 @@
>> > > > >  #include "xfs_error.h"
>> > > > >  #include "xfs_trans.h"
>> > > > >  #include "xfs_trans_space.h"
>> > > > > +#include "xfs_inode_item.h"
>> > > > >  #include "xfs_iomap.h"
>> > > > >  #include "xfs_trace.h"
>> > > > >  #include "xfs_icache.h"
>> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
>> > > > >   trace_xfs_iomap_found(ip, offset, length, 0, );
>> > > > >   }
>> > > > >
>> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
>> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>> > > > > + iomap->flags |= IOMAP_F_DIRTY;
>> > > >
>> > > > This is the very definition of an inode that is "fdatasync dirty".
>> > > >
>> > > > H, shouldn't this also be set for read faults, too?
>> > >
>> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write 
>> > > any
>> > > data to the page which he'd then like to be persistent. The only reason 
>> > > why
>> > > I thought it could be useful for a while was that it would be nice to 
>> > > make
>> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
>> > > you'll see after a crash
>> >
>> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
>> > from a page fault, the app knows that the data and it's underlying
>> > extent is on persistent storage?
>> >
>> > > but we cannot provide that guarantee for RO
>> > > mapping anyway if someone else has the page mapped as well. So I just
>> > > decided not to return IOMAP_F_DIRTY for read faults.
>> >
>> > If there are multiple MAP_SYNC mappings to the inode, I would have
>> > expected that they all sync all of the data/metadata on every page
>> > fault, regardless of who dirtied the inode. An RO mapping doesn't
>>
>> Well, they all do sync regardless of who dirtied the inode on every *write*
>> fault.
>>
>> > mean the data/metadata on the inode can't change, it just means it
>> > can't change through that mapping.  Running fsync() to guarantee the
>> > persistence of that data/metadata doesn't actually changing any
>> > data
>> >
>> > IOWs, if read faults don't guarantee the mapped range has stable
>> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
>> > because it's not giving consistent guarantees to userspace. Yes, it
>> > works fine when only one MAP_SYNC mapping is modifying the inode,
>> > but the moment we have concurrent operations on the inode that
>> > aren't MAP_SYNC or O_SYNC this goes out the window
>>
>> MAP_SYNC as I've implemented it provides guarantees only for data the
>> process has actually written. I agree with that and it was a conscious
>> decision. In my opinion that covers most usecases, provides reasonably
>> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
>> persist it just using CPU instructions), and reasonable performance.
>>
>> Now you seem to suggest the semantics should be: "Data you have read from or
>> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
>> from implementation POV we can do that rather easily (just rip out the
>> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
>> be useful enough to justify the slowdown of read faults? I was not able to
>> come up with a good usecase and so I've decided for current semantics. What
>> do other people think?
>
> Nobody commented on this for couple of days so how do we proceed? I would
> prefer to go just with a guarantee for data written and we can always make
> the guarantee stronger (i.e. apply it also for read data) when some user
> comes with a good usecase?

I think it is easier to strengthen the guarantee than loosen it later
especially since it is not yet clear that we have a use case for the
stronger semantic. At least the initial motivation for MAP_SYNC was
for writers.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-31 Thread Jan Kara
On Fri 27-10-17 12:08:34, Jan Kara wrote:
> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index f179bdf1644d..b43be199fbdf 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -33,6 +33,7 @@
> > > > >  #include "xfs_error.h"
> > > > >  #include "xfs_trans.h"
> > > > >  #include "xfs_trans_space.h"
> > > > > +#include "xfs_inode_item.h"
> > > > >  #include "xfs_iomap.h"
> > > > >  #include "xfs_trace.h"
> > > > >  #include "xfs_icache.h"
> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > > >   trace_xfs_iomap_found(ip, offset, length, 0, );
> > > > >   }
> > > > >  
> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > > > + iomap->flags |= IOMAP_F_DIRTY;
> > > > 
> > > > This is the very definition of an inode that is "fdatasync dirty".
> > > > 
> > > > H, shouldn't this also be set for read faults, too?
> > > 
> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write 
> > > any
> > > data to the page which he'd then like to be persistent. The only reason 
> > > why
> > > I thought it could be useful for a while was that it would be nice to make
> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
> > > you'll see after a crash
> > 
> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
> > from a page fault, the app knows that the data and it's underlying
> > extent is on persistent storage?
> > 
> > > but we cannot provide that guarantee for RO
> > > mapping anyway if someone else has the page mapped as well. So I just
> > > decided not to return IOMAP_F_DIRTY for read faults.
> > 
> > If there are multiple MAP_SYNC mappings to the inode, I would have
> > expected that they all sync all of the data/metadata on every page
> > fault, regardless of who dirtied the inode. An RO mapping doesn't
> 
> Well, they all do sync regardless of who dirtied the inode on every *write*
> fault.
> 
> > mean the data/metadata on the inode can't change, it just means it
> > can't change through that mapping.  Running fsync() to guarantee the
> > persistence of that data/metadata doesn't actually changing any
> > data
> > 
> > IOWs, if read faults don't guarantee the mapped range has stable
> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> > because it's not giving consistent guarantees to userspace. Yes, it
> > works fine when only one MAP_SYNC mapping is modifying the inode,
> > but the moment we have concurrent operations on the inode that
> > aren't MAP_SYNC or O_SYNC this goes out the window
> 
> MAP_SYNC as I've implemented it provides guarantees only for data the
> process has actually written. I agree with that and it was a conscious
> decision. In my opinion that covers most usecases, provides reasonably
> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
> persist it just using CPU instructions), and reasonable performance.
> 
> Now you seem to suggest the semantics should be: "Data you have read from or
> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
> from implementation POV we can do that rather easily (just rip out the
> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
> be useful enough to justify the slowdown of read faults? I was not able to
> come up with a good usecase and so I've decided for current semantics. What
> do other people think?

Nobody commented on this for couple of days so how do we proceed? I would
prefer to go just with a guarantee for data written and we can always make
the guarantee stronger (i.e. apply it also for read data) when some user
comes with a good usecase?

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-27 Thread Jan Kara
On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index f179bdf1644d..b43be199fbdf 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -33,6 +33,7 @@
> > > >  #include "xfs_error.h"
> > > >  #include "xfs_trans.h"
> > > >  #include "xfs_trans_space.h"
> > > > +#include "xfs_inode_item.h"
> > > >  #include "xfs_iomap.h"
> > > >  #include "xfs_trace.h"
> > > >  #include "xfs_icache.h"
> > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > > trace_xfs_iomap_found(ip, offset, length, 0, );
> > > > }
> > > >  
> > > > +   if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > > +   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > > +   iomap->flags |= IOMAP_F_DIRTY;
> > > 
> > > This is the very definition of an inode that is "fdatasync dirty".
> > > 
> > > H, shouldn't this also be set for read faults, too?
> > 
> > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> > data to the page which he'd then like to be persistent. The only reason why
> > I thought it could be useful for a while was that it would be nice to make
> > MAP_SYNC mapping provide the guarantee that data you see now is the data
> > you'll see after a crash
> 
> Isn't that the entire point of MAP_SYNC? i.e. That when we return
> from a page fault, the app knows that the data and it's underlying
> extent is on persistent storage?
> 
> > but we cannot provide that guarantee for RO
> > mapping anyway if someone else has the page mapped as well. So I just
> > decided not to return IOMAP_F_DIRTY for read faults.
> 
> If there are multiple MAP_SYNC mappings to the inode, I would have
> expected that they all sync all of the data/metadata on every page
> fault, regardless of who dirtied the inode. An RO mapping doesn't

Well, they all do sync regardless of who dirtied the inode on every *write*
fault.

> mean the data/metadata on the inode can't change, it just means it
> can't change through that mapping.  Running fsync() to guarantee the
> persistence of that data/metadata doesn't actually changing any
> data
> 
> IOWs, if read faults don't guarantee the mapped range has stable
> extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> because it's not giving consistent guarantees to userspace. Yes, it
> works fine when only one MAP_SYNC mapping is modifying the inode,
> but the moment we have concurrent operations on the inode that
> aren't MAP_SYNC or O_SYNC this goes out the window

MAP_SYNC as I've implemented it provides guarantees only for data the
process has actually written. I agree with that and it was a conscious
decision. In my opinion that covers most usecases, provides reasonably
simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
persist it just using CPU instructions), and reasonable performance.

Now you seem to suggest the semantics should be: "Data you have read from or
written to a MAP_SYNC mapping can be persisted using CPU instructions." And
from implementation POV we can do that rather easily (just rip out the
IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
be useful enough to justify the slowdown of read faults? I was not able to
come up with a good usecase and so I've decided for current semantics. What
do other people think?

And now that I've spelled out exact semantics I don't think your comparison
that you can fsync() data you didn't write quite matches - with MAP_SYNC
you will have to at least read the data to be able to persist it and you
don't have that requirement for fsync() either...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-27 Thread Jan Kara
On Fri 27-10-17 08:43:01, Christoph Hellwig wrote:
> On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > But now that I look at XFS implementation again, it misses handling
> > of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
> > I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
> > as well since it mostly duplicates it anyway... Thanks for inquiring!
> 
> My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault,
> but that didn't work.  Wish I'd remember why, though.

Maybe due to the additional check on IS_DAX(inode) in __xfs_filemap_fault()
which could do the wrong thing if per-inode DAX flag is switched? Because
otherwise __xfs_filemap_fault(vmf, PE_SIZE_PTE, true) does exactly the same
thing as xfs_filemap_pfn_mkwrite() did.

If we do care about per-inode DAX flag switching, I can just fixup
xfs_filemap_pfn_mkwrite() but my understanding was that we ditched the idea
at least until someone comes up with a reliable way to implement that...

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-26 Thread Dave Chinner
On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > > From: Christoph Hellwig 
> > > 
> > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > > blocks for writing and the inode is pinned, and has dirty fields other
> > > than the timestamps.
> > 
> > That's "fdatasync dirty", not "fsync dirty".
> 
> Correct.
> 
> > IOMAP_F_DIRTY needs a far better description of it's semantics than
> > "/* block mapping is not yet on persistent storage */" so we know
> > exactly what filesystems are supposed to be implementing here. I
> > suspect that what it really is meant to say is:
> > 
> > /*
> >  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
> >  * written data and requires fdatasync to commit to persistent storage.
> >  */
> 
> I'll update the comment. Thanks!
> 
> > []
> > 
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index f179bdf1644d..b43be199fbdf 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -33,6 +33,7 @@
> > >  #include "xfs_error.h"
> > >  #include "xfs_trans.h"
> > >  #include "xfs_trans_space.h"
> > > +#include "xfs_inode_item.h"
> > >  #include "xfs_iomap.h"
> > >  #include "xfs_trace.h"
> > >  #include "xfs_icache.h"
> > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > >   trace_xfs_iomap_found(ip, offset, length, 0, );
> > >   }
> > >  
> > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > + iomap->flags |= IOMAP_F_DIRTY;
> > 
> > This is the very definition of an inode that is "fdatasync dirty".
> > 
> > H, shouldn't this also be set for read faults, too?
> 
> No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> data to the page which he'd then like to be persistent. The only reason why
> I thought it could be useful for a while was that it would be nice to make
> MAP_SYNC mapping provide the guarantee that data you see now is the data
> you'll see after a crash

Isn't that the entire point of MAP_SYNC? i.e. That when we return
from a page fault, the app knows that the data and it's underlying
extent is on persistent storage?

> but we cannot provide that guarantee for RO
> mapping anyway if someone else has the page mapped as well. So I just
> decided not to return IOMAP_F_DIRTY for read faults.

If there are multiple MAP_SYNC mappings to the inode, I would have
expected that they all sync all of the data/metadata on every page
fault, regardless of who dirtied the inode. An RO mapping doesn't
mean the data/metadata on the inode can't change, it just means it
can't change through that mapping.  Running fsync() to guarantee the
persistence of that data/metadata doesn't actually changing any
data

IOWs, if read faults don't guarantee the mapped range has stable
extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
because it's not giving consistent guarantees to userspace. Yes, it
works fine when only one MAP_SYNC mapping is modifying the inode,
but the moment we have concurrent operations on the inode that
aren't MAP_SYNC or O_SYNC this goes out the window

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-26 Thread Jan Kara
On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > From: Christoph Hellwig 
> > 
> > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > blocks for writing and the inode is pinned, and has dirty fields other
> > than the timestamps.
> 
> That's "fdatasync dirty", not "fsync dirty".

Correct.

> IOMAP_F_DIRTY needs a far better description of it's semantics than
> "/* block mapping is not yet on persistent storage */" so we know
> exactly what filesystems are supposed to be implementing here. I
> suspect that what it really is meant to say is:
> 
> /*
>  * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
>  * written data and requires fdatasync to commit to persistent storage.
>  */

I'll update the comment. Thanks!

> []
> 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index f179bdf1644d..b43be199fbdf 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -33,6 +33,7 @@
> >  #include "xfs_error.h"
> >  #include "xfs_trans.h"
> >  #include "xfs_trans_space.h"
> > +#include "xfs_inode_item.h"
> >  #include "xfs_iomap.h"
> >  #include "xfs_trace.h"
> >  #include "xfs_icache.h"
> > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > trace_xfs_iomap_found(ip, offset, length, 0, );
> > }
> >  
> > +   if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > +   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > +   iomap->flags |= IOMAP_F_DIRTY;
> 
> This is the very definition of an inode that is "fdatasync dirty".
> 
> H, shouldn't this also be set for read faults, too?

No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
data to the page which he'd then like to be persistent. The only reason why
I thought it could be useful for a while was that it would be nice to make
MAP_SYNC mapping provide the guarantee that data you see now is the data
you'll see after a crash but we cannot provide that guarantee for RO
mapping anyway if someone else has the page mapped as well. So I just
decided not to return IOMAP_F_DIRTY for read faults.

But now that I look at XFS implementation again, it misses handling
of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
as well since it mostly duplicates it anyway... Thanks for inquiring!

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-24 Thread Dave Chinner
On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> From: Christoph Hellwig 
> 
> Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> blocks for writing and the inode is pinned, and has dirty fields other
> than the timestamps.

That's "fdatasync dirty", not "fsync dirty".

IOMAP_F_DIRTY needs a far better description of it's semantics than
"/* block mapping is not yet on persistent storage */" so we know
exactly what filesystems are supposed to be implementing here. I
suspect that what it really is meant to say is:

/*
 * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
 * written data and requires fdatasync to commit to persistent storage.
 */

[]

> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index f179bdf1644d..b43be199fbdf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -33,6 +33,7 @@
>  #include "xfs_error.h"
>  #include "xfs_trans.h"
>  #include "xfs_trans_space.h"
> +#include "xfs_inode_item.h"
>  #include "xfs_iomap.h"
>  #include "xfs_trace.h"
>  #include "xfs_icache.h"
> @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
>   trace_xfs_iomap_found(ip, offset, length, 0, );
>   }
>  
> + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> + iomap->flags |= IOMAP_F_DIRTY;

This is the very definition of an inode that is "fdatasync dirty".

H, shouldn't this also be set for read faults, too?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-24 Thread Ross Zwisler
On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> From: Christoph Hellwig 
> 
> Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> blocks for writing and the inode is pinned, and has dirty fields other
> than the timestamps.  In __xfs_filemap_fault() we then detect this case
> and call dax_finish_sync_fault() to make sure all metadata is committed,
> and to insert the page table entry.
> 
> Note that this will also dirty corresponding radix tree entry which is
> what we want - fsync(2) will still provide data integrity guarantees for
> applications not using userspace flushing. And applications using
> userspace flushing can avoid calling fsync(2) and thus avoid the
> performance overhead.
> 
> [JK: Added VM_SYNC flag handling]
> 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Jan Kara 

I don't know enough about XFS dirty inode handling to be able to comment on
the changes in xfs_file_iomap_begin(), but the rest looks good.

Reviewed-by: Ross Zwisler 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-24 Thread Jan Kara
From: Christoph Hellwig 

Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
blocks for writing and the inode is pinned, and has dirty fields other
than the timestamps.  In __xfs_filemap_fault() we then detect this case
and call dax_finish_sync_fault() to make sure all metadata is committed,
and to insert the page table entry.

Note that this will also dirty corresponding radix tree entry which is
what we want - fsync(2) will still provide data integrity guarantees for
applications not using userspace flushing. And applications using
userspace flushing can avoid calling fsync(2) and thus avoid the
performance overhead.

[JK: Added VM_SYNC flag handling]

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/xfs/xfs_file.c  | 15 ++-
 fs/xfs/xfs_iomap.c |  5 +
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7c6b8def6eed..02093df4b314 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -1040,7 +1041,11 @@ __xfs_filemap_fault(
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
-   ret = dax_iomap_fault(vmf, pe_size, NULL, _iomap_ops);
+   pfn_t pfn;
+
+   ret = dax_iomap_fault(vmf, pe_size, , _iomap_ops);
+   if (ret & VM_FAULT_NEEDDSYNC)
+   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
} else {
if (write_fault)
ret = iomap_page_mkwrite(vmf, _iomap_ops);
@@ -1131,6 +1136,13 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
 {
+   /*
+* We don't support synchronous mappings for non-DAX files. At least
+* until someone comes with a sensible use case.
+*/
+   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+   return -EOPNOTSUPP;
+
file_accessed(filp);
vma->vm_ops = _file_vm_ops;
if (IS_DAX(file_inode(filp)))
@@ -1149,6 +1161,7 @@ const struct file_operations xfs_file_operations = {
.compat_ioctl   = xfs_file_compat_ioctl,
 #endif
.mmap   = xfs_file_mmap,
+   .mmap_supported_flags = MAP_SYNC,
.open   = xfs_file_open,
.release= xfs_file_release,
.fsync  = xfs_file_fsync,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..b43be199fbdf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -33,6 +33,7 @@
 #include "xfs_error.h"
 #include "xfs_trans.h"
 #include "xfs_trans_space.h"
+#include "xfs_inode_item.h"
 #include "xfs_iomap.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
@@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
trace_xfs_iomap_found(ip, offset, length, 0, );
}
 
+   if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
+   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+   iomap->flags |= IOMAP_F_DIRTY;
+
xfs_bmbt_to_iomap(ip, iomap, );
 
if (shared)
-- 
2.12.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-19 Thread Christoph Hellwig
> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_SYNC)

I'd kill this define.  Also is there any good reason that we have
to add LEGACY_MAP_MASK instead of assuming it's supported in the
core?

>  #endif
>   .mmap   = xfs_file_mmap,
> + .mmap_supported_flags   = XFS_MAP_SUPPORTED,
>   .open   = xfs_file_open,
>   .release= xfs_file_release,
>   .fsync  = xfs_file_fsync,

I usually either reformat all members to be aligned again, or if that's
too much churn (in this case it probably is) just use a single space
before the = to minimize the alignment differences.

Otherwise your changes look good to me.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 17/17] xfs: support for synchronous DAX faults

2017-10-19 Thread Jan Kara
From: Christoph Hellwig 

Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
blocks for writing and the inode is pinned, and has dirty fields other
than the timestamps.  In __xfs_filemap_fault() we then detect this case
and call dax_finish_sync_fault() to make sure all metadata is committed,
and to insert the page table entry.

Note that this will also dirty corresponding radix tree entry which is
what we want - fsync(2) will still provide data integrity guarantees for
applications not using userspace flushing. And applications using
userspace flushing can avoid calling fsync(2) and thus avoid the
performance overhead.

[JK: Added VM_SYNC flag handling]

Signed-off-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 fs/xfs/xfs_file.c  | 17 -
 fs/xfs/xfs_iomap.c |  5 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7c6b8def6eed..9b8058774af3 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static const struct vm_operations_struct xfs_file_vm_ops;
 
@@ -1040,7 +1041,11 @@ __xfs_filemap_fault(
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
-   ret = dax_iomap_fault(vmf, pe_size, NULL, _iomap_ops);
+   pfn_t pfn;
+
+   ret = dax_iomap_fault(vmf, pe_size, , _iomap_ops);
+   if (ret & VM_FAULT_NEEDDSYNC)
+   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
} else {
if (write_fault)
ret = iomap_page_mkwrite(vmf, _iomap_ops);
@@ -1131,6 +1136,13 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
 {
+   /*
+* We don't support synchronous mappings for non-DAX files. At least
+* until someone comes with a sensible use case.
+*/
+   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+   return -EOPNOTSUPP;
+
file_accessed(filp);
vma->vm_ops = _file_vm_ops;
if (IS_DAX(file_inode(filp)))
@@ -1138,6 +1150,8 @@ xfs_file_mmap(
return 0;
 }
 
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_SYNC)
+
 const struct file_operations xfs_file_operations = {
.llseek = xfs_file_llseek,
.read_iter  = xfs_file_read_iter,
@@ -1149,6 +1163,7 @@ const struct file_operations xfs_file_operations = {
.compat_ioctl   = xfs_file_compat_ioctl,
 #endif
.mmap   = xfs_file_mmap,
+   .mmap_supported_flags   = XFS_MAP_SUPPORTED,
.open   = xfs_file_open,
.release= xfs_file_release,
.fsync  = xfs_file_fsync,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..b43be199fbdf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -33,6 +33,7 @@
 #include "xfs_error.h"
 #include "xfs_trans.h"
 #include "xfs_trans_space.h"
+#include "xfs_inode_item.h"
 #include "xfs_iomap.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
@@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
trace_xfs_iomap_found(ip, offset, length, 0, );
}
 
+   if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
+   (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+   iomap->flags |= IOMAP_F_DIRTY;
+
xfs_bmbt_to_iomap(ip, iomap, );
 
if (shared)
-- 
2.12.3

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm