Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Tue, Oct 31, 2017 at 02:50:01PM -0700, Dan Williams wrote: > On Tue, Oct 31, 2017 at 8:19 AM, Jan Karawrote: > > On Fri 27-10-17 12:08:34, Jan Kara wrote: > >> On Fri 27-10-17 08:16:11, Dave Chinner wrote: > >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > >> > > > > index f179bdf1644d..b43be199fbdf 100644 > >> > > > > --- a/fs/xfs/xfs_iomap.c > >> > > > > +++ b/fs/xfs/xfs_iomap.c > >> > > > > @@ -33,6 +33,7 @@ > >> > > > > #include "xfs_error.h" > >> > > > > #include "xfs_trans.h" > >> > > > > #include "xfs_trans_space.h" > >> > > > > +#include "xfs_inode_item.h" > >> > > > > #include "xfs_iomap.h" > >> > > > > #include "xfs_trace.h" > >> > > > > #include "xfs_icache.h" > >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > >> > > > > trace_xfs_iomap_found(ip, offset, length, 0, ); > >> > > > > } > >> > > > > > >> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > >> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > >> > > > > + iomap->flags |= IOMAP_F_DIRTY; > >> > > > > >> > > > This is the very definition of an inode that is "fdatasync dirty". > >> > > > > >> > > > H, shouldn't this also be set for read faults, too? > >> > > > >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot > >> > > write any > >> > > data to the page which he'd then like to be persistent. The only > >> > > reason why > >> > > I thought it could be useful for a while was that it would be nice to > >> > > make > >> > > MAP_SYNC mapping provide the guarantee that data you see now is the > >> > > data > >> > > you'll see after a crash > >> > > >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return > >> > from a page fault, the app knows that the data and it's underlying > >> > extent is on persistent storage? > >> > > >> > > but we cannot provide that guarantee for RO > >> > > mapping anyway if someone else has the page mapped as well. So I just > >> > > decided not to return IOMAP_F_DIRTY for read faults. > >> > > >> > If there are multiple MAP_SYNC mappings to the inode, I would have > >> > expected that they all sync all of the data/metadata on every page > >> > fault, regardless of who dirtied the inode. An RO mapping doesn't > >> > >> Well, they all do sync regardless of who dirtied the inode on every *write* > >> fault. > >> > >> > mean the data/metadata on the inode can't change, it just means it > >> > can't change through that mapping. Running fsync() to guarantee the > >> > persistence of that data/metadata doesn't actually changing any > >> > data > >> > > >> > IOWs, if read faults don't guarantee the mapped range has stable > >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken > >> > because it's not giving consistent guarantees to userspace. Yes, it > >> > works fine when only one MAP_SYNC mapping is modifying the inode, > >> > but the moment we have concurrent operations on the inode that > >> > aren't MAP_SYNC or O_SYNC this goes out the window > >> > >> MAP_SYNC as I've implemented it provides guarantees only for data the > >> process has actually written. I agree with that and it was a conscious > >> decision. In my opinion that covers most usecases, provides reasonably > >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can > >> persist it just using CPU instructions), and reasonable performance. > >> > >> Now you seem to suggest the semantics should be: "Data you have read from > >> or > >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And > >> from implementation POV we can do that rather easily (just rip out the > >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would > >> be useful enough to justify the slowdown of read faults? I was not able to > >> come up with a good usecase and so I've decided for current semantics. What > >> do other people think? > > > > Nobody commented on this for couple of days so how do we proceed? I would > > prefer to go just with a guarantee for data written and we can always make > > the guarantee stronger (i.e. apply it also for read data) when some user > > comes with a good usecase? > > I think it is easier to strengthen the guarantee than loosen it later > especially since it is not yet clear that we have a use case for the > stronger semantic. At least the initial motivation for MAP_SYNC was > for writers. I agree. It seems like all threads/processes in a given application need to use MAP_SYNC consistently so they can be sure that data that is written (and then possibly read) will be durable on media. I think what you have is a good starting point, and we can adjust later if necessary. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Tue, Oct 31, 2017 at 8:19 AM, Jan Karawrote: > On Fri 27-10-17 12:08:34, Jan Kara wrote: >> On Fri 27-10-17 08:16:11, Dave Chinner wrote: >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c >> > > > > index f179bdf1644d..b43be199fbdf 100644 >> > > > > --- a/fs/xfs/xfs_iomap.c >> > > > > +++ b/fs/xfs/xfs_iomap.c >> > > > > @@ -33,6 +33,7 @@ >> > > > > #include "xfs_error.h" >> > > > > #include "xfs_trans.h" >> > > > > #include "xfs_trans_space.h" >> > > > > +#include "xfs_inode_item.h" >> > > > > #include "xfs_iomap.h" >> > > > > #include "xfs_trace.h" >> > > > > #include "xfs_icache.h" >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( >> > > > > trace_xfs_iomap_found(ip, offset, length, 0, ); >> > > > > } >> > > > > >> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && >> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) >> > > > > + iomap->flags |= IOMAP_F_DIRTY; >> > > > >> > > > This is the very definition of an inode that is "fdatasync dirty". >> > > > >> > > > H, shouldn't this also be set for read faults, too? >> > > >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write >> > > any >> > > data to the page which he'd then like to be persistent. The only reason >> > > why >> > > I thought it could be useful for a while was that it would be nice to >> > > make >> > > MAP_SYNC mapping provide the guarantee that data you see now is the data >> > > you'll see after a crash >> > >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return >> > from a page fault, the app knows that the data and it's underlying >> > extent is on persistent storage? >> > >> > > but we cannot provide that guarantee for RO >> > > mapping anyway if someone else has the page mapped as well. So I just >> > > decided not to return IOMAP_F_DIRTY for read faults. >> > >> > If there are multiple MAP_SYNC mappings to the inode, I would have >> > expected that they all sync all of the data/metadata on every page >> > fault, regardless of who dirtied the inode. An RO mapping doesn't >> >> Well, they all do sync regardless of who dirtied the inode on every *write* >> fault. >> >> > mean the data/metadata on the inode can't change, it just means it >> > can't change through that mapping. Running fsync() to guarantee the >> > persistence of that data/metadata doesn't actually changing any >> > data >> > >> > IOWs, if read faults don't guarantee the mapped range has stable >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken >> > because it's not giving consistent guarantees to userspace. Yes, it >> > works fine when only one MAP_SYNC mapping is modifying the inode, >> > but the moment we have concurrent operations on the inode that >> > aren't MAP_SYNC or O_SYNC this goes out the window >> >> MAP_SYNC as I've implemented it provides guarantees only for data the >> process has actually written. I agree with that and it was a conscious >> decision. In my opinion that covers most usecases, provides reasonably >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can >> persist it just using CPU instructions), and reasonable performance. >> >> Now you seem to suggest the semantics should be: "Data you have read from or >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And >> from implementation POV we can do that rather easily (just rip out the >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would >> be useful enough to justify the slowdown of read faults? I was not able to >> come up with a good usecase and so I've decided for current semantics. What >> do other people think? > > Nobody commented on this for couple of days so how do we proceed? I would > prefer to go just with a guarantee for data written and we can always make > the guarantee stronger (i.e. apply it also for read data) when some user > comes with a good usecase? I think it is easier to strengthen the guarantee than loosen it later especially since it is not yet clear that we have a use case for the stronger semantic. At least the initial motivation for MAP_SYNC was for writers. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Fri 27-10-17 12:08:34, Jan Kara wrote: > On Fri 27-10-17 08:16:11, Dave Chinner wrote: > > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > > > index f179bdf1644d..b43be199fbdf 100644 > > > > > --- a/fs/xfs/xfs_iomap.c > > > > > +++ b/fs/xfs/xfs_iomap.c > > > > > @@ -33,6 +33,7 @@ > > > > > #include "xfs_error.h" > > > > > #include "xfs_trans.h" > > > > > #include "xfs_trans_space.h" > > > > > +#include "xfs_inode_item.h" > > > > > #include "xfs_iomap.h" > > > > > #include "xfs_trace.h" > > > > > #include "xfs_icache.h" > > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > > > trace_xfs_iomap_found(ip, offset, length, 0, ); > > > > > } > > > > > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > > > > > H, shouldn't this also be set for read faults, too? > > > > > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write > > > any > > > data to the page which he'd then like to be persistent. The only reason > > > why > > > I thought it could be useful for a while was that it would be nice to make > > > MAP_SYNC mapping provide the guarantee that data you see now is the data > > > you'll see after a crash > > > > Isn't that the entire point of MAP_SYNC? i.e. That when we return > > from a page fault, the app knows that the data and it's underlying > > extent is on persistent storage? > > > > > but we cannot provide that guarantee for RO > > > mapping anyway if someone else has the page mapped as well. So I just > > > decided not to return IOMAP_F_DIRTY for read faults. > > > > If there are multiple MAP_SYNC mappings to the inode, I would have > > expected that they all sync all of the data/metadata on every page > > fault, regardless of who dirtied the inode. An RO mapping doesn't > > Well, they all do sync regardless of who dirtied the inode on every *write* > fault. > > > mean the data/metadata on the inode can't change, it just means it > > can't change through that mapping. Running fsync() to guarantee the > > persistence of that data/metadata doesn't actually changing any > > data > > > > IOWs, if read faults don't guarantee the mapped range has stable > > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken > > because it's not giving consistent guarantees to userspace. Yes, it > > works fine when only one MAP_SYNC mapping is modifying the inode, > > but the moment we have concurrent operations on the inode that > > aren't MAP_SYNC or O_SYNC this goes out the window > > MAP_SYNC as I've implemented it provides guarantees only for data the > process has actually written. I agree with that and it was a conscious > decision. In my opinion that covers most usecases, provides reasonably > simple semantics (i.e., if you write data through MAP_SYNC mapping, you can > persist it just using CPU instructions), and reasonable performance. > > Now you seem to suggest the semantics should be: "Data you have read from or > written to a MAP_SYNC mapping can be persisted using CPU instructions." And > from implementation POV we can do that rather easily (just rip out the > IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would > be useful enough to justify the slowdown of read faults? I was not able to > come up with a good usecase and so I've decided for current semantics. What > do other people think? Nobody commented on this for couple of days so how do we proceed? I would prefer to go just with a guarantee for data written and we can always make the guarantee stronger (i.e. apply it also for read data) when some user comes with a good usecase? Honza -- Jan KaraSUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Fri 27-10-17 08:16:11, Dave Chinner wrote: > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > > index f179bdf1644d..b43be199fbdf 100644 > > > > --- a/fs/xfs/xfs_iomap.c > > > > +++ b/fs/xfs/xfs_iomap.c > > > > @@ -33,6 +33,7 @@ > > > > #include "xfs_error.h" > > > > #include "xfs_trans.h" > > > > #include "xfs_trans_space.h" > > > > +#include "xfs_inode_item.h" > > > > #include "xfs_iomap.h" > > > > #include "xfs_trace.h" > > > > #include "xfs_icache.h" > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > > trace_xfs_iomap_found(ip, offset, length, 0, ); > > > > } > > > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > > > H, shouldn't this also be set for read faults, too? > > > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > > data to the page which he'd then like to be persistent. The only reason why > > I thought it could be useful for a while was that it would be nice to make > > MAP_SYNC mapping provide the guarantee that data you see now is the data > > you'll see after a crash > > Isn't that the entire point of MAP_SYNC? i.e. That when we return > from a page fault, the app knows that the data and it's underlying > extent is on persistent storage? > > > but we cannot provide that guarantee for RO > > mapping anyway if someone else has the page mapped as well. So I just > > decided not to return IOMAP_F_DIRTY for read faults. > > If there are multiple MAP_SYNC mappings to the inode, I would have > expected that they all sync all of the data/metadata on every page > fault, regardless of who dirtied the inode. An RO mapping doesn't Well, they all do sync regardless of who dirtied the inode on every *write* fault. > mean the data/metadata on the inode can't change, it just means it > can't change through that mapping. Running fsync() to guarantee the > persistence of that data/metadata doesn't actually changing any > data > > IOWs, if read faults don't guarantee the mapped range has stable > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken > because it's not giving consistent guarantees to userspace. Yes, it > works fine when only one MAP_SYNC mapping is modifying the inode, > but the moment we have concurrent operations on the inode that > aren't MAP_SYNC or O_SYNC this goes out the window MAP_SYNC as I've implemented it provides guarantees only for data the process has actually written. I agree with that and it was a conscious decision. In my opinion that covers most usecases, provides reasonably simple semantics (i.e., if you write data through MAP_SYNC mapping, you can persist it just using CPU instructions), and reasonable performance. Now you seem to suggest the semantics should be: "Data you have read from or written to a MAP_SYNC mapping can be persisted using CPU instructions." And from implementation POV we can do that rather easily (just rip out the IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would be useful enough to justify the slowdown of read faults? I was not able to come up with a good usecase and so I've decided for current semantics. What do other people think? And now that I've spelled out exact semantics I don't think your comparison that you can fsync() data you didn't write quite matches - with MAP_SYNC you will have to at least read the data to be able to persist it and you don't have that requirement for fsync() either... Honza -- Jan KaraSUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Fri 27-10-17 08:43:01, Christoph Hellwig wrote: > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > > But now that I look at XFS implementation again, it misses handling > > of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right). > > I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite() > > as well since it mostly duplicates it anyway... Thanks for inquiring! > > My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault, > but that didn't work. Wish I'd remember why, though. Maybe due to the additional check on IS_DAX(inode) in __xfs_filemap_fault() which could do the wrong thing if per-inode DAX flag is switched? Because otherwise __xfs_filemap_fault(vmf, PE_SIZE_PTE, true) does exactly the same thing as xfs_filemap_pfn_mkwrite() did. If we do care about per-inode DAX flag switching, I can just fixup xfs_filemap_pfn_mkwrite() but my understanding was that we ditched the idea at least until someone comes up with a reliable way to implement that... Honza -- Jan KaraSUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > On Wed 25-10-17 09:23:22, Dave Chinner wrote: > > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > > > From: Christoph Hellwig> > > > > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > > > blocks for writing and the inode is pinned, and has dirty fields other > > > than the timestamps. > > > > That's "fdatasync dirty", not "fsync dirty". > > Correct. > > > IOMAP_F_DIRTY needs a far better description of it's semantics than > > "/* block mapping is not yet on persistent storage */" so we know > > exactly what filesystems are supposed to be implementing here. I > > suspect that what it really is meant to say is: > > > > /* > > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to > > * written data and requires fdatasync to commit to persistent storage. > > */ > > I'll update the comment. Thanks! > > > [] > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index f179bdf1644d..b43be199fbdf 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -33,6 +33,7 @@ > > > #include "xfs_error.h" > > > #include "xfs_trans.h" > > > #include "xfs_trans_space.h" > > > +#include "xfs_inode_item.h" > > > #include "xfs_iomap.h" > > > #include "xfs_trace.h" > > > #include "xfs_icache.h" > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > trace_xfs_iomap_found(ip, offset, length, 0, ); > > > } > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > H, shouldn't this also be set for read faults, too? > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > data to the page which he'd then like to be persistent. The only reason why > I thought it could be useful for a while was that it would be nice to make > MAP_SYNC mapping provide the guarantee that data you see now is the data > you'll see after a crash Isn't that the entire point of MAP_SYNC? i.e. That when we return from a page fault, the app knows that the data and it's underlying extent is on persistent storage? > but we cannot provide that guarantee for RO > mapping anyway if someone else has the page mapped as well. So I just > decided not to return IOMAP_F_DIRTY for read faults. If there are multiple MAP_SYNC mappings to the inode, I would have expected that they all sync all of the data/metadata on every page fault, regardless of who dirtied the inode. An RO mapping doesn't mean the data/metadata on the inode can't change, it just means it can't change through that mapping. Running fsync() to guarantee the persistence of that data/metadata doesn't actually changing any data IOWs, if read faults don't guarantee the mapped range has stable extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken because it's not giving consistent guarantees to userspace. Yes, it works fine when only one MAP_SYNC mapping is modifying the inode, but the moment we have concurrent operations on the inode that aren't MAP_SYNC or O_SYNC this goes out the window Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Wed 25-10-17 09:23:22, Dave Chinner wrote: > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > > From: Christoph Hellwig> > > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > > blocks for writing and the inode is pinned, and has dirty fields other > > than the timestamps. > > That's "fdatasync dirty", not "fsync dirty". Correct. > IOMAP_F_DIRTY needs a far better description of it's semantics than > "/* block mapping is not yet on persistent storage */" so we know > exactly what filesystems are supposed to be implementing here. I > suspect that what it really is meant to say is: > > /* > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to > * written data and requires fdatasync to commit to persistent storage. > */ I'll update the comment. Thanks! > [] > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index f179bdf1644d..b43be199fbdf 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -33,6 +33,7 @@ > > #include "xfs_error.h" > > #include "xfs_trans.h" > > #include "xfs_trans_space.h" > > +#include "xfs_inode_item.h" > > #include "xfs_iomap.h" > > #include "xfs_trace.h" > > #include "xfs_icache.h" > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > trace_xfs_iomap_found(ip, offset, length, 0, ); > > } > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > + iomap->flags |= IOMAP_F_DIRTY; > > This is the very definition of an inode that is "fdatasync dirty". > > H, shouldn't this also be set for read faults, too? No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any data to the page which he'd then like to be persistent. The only reason why I thought it could be useful for a while was that it would be nice to make MAP_SYNC mapping provide the guarantee that data you see now is the data you'll see after a crash but we cannot provide that guarantee for RO mapping anyway if someone else has the page mapped as well. So I just decided not to return IOMAP_F_DIRTY for read faults. But now that I look at XFS implementation again, it misses handling of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right). I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite() as well since it mostly duplicates it anyway... Thanks for inquiring! Honza -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > From: Christoph Hellwig> > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > blocks for writing and the inode is pinned, and has dirty fields other > than the timestamps. That's "fdatasync dirty", not "fsync dirty". IOMAP_F_DIRTY needs a far better description of it's semantics than "/* block mapping is not yet on persistent storage */" so we know exactly what filesystems are supposed to be implementing here. I suspect that what it really is meant to say is: /* * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to * written data and requires fdatasync to commit to persistent storage. */ [] > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index f179bdf1644d..b43be199fbdf 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -33,6 +33,7 @@ > #include "xfs_error.h" > #include "xfs_trans.h" > #include "xfs_trans_space.h" > +#include "xfs_inode_item.h" > #include "xfs_iomap.h" > #include "xfs_trace.h" > #include "xfs_icache.h" > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > trace_xfs_iomap_found(ip, offset, length, 0, ); > } > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > + iomap->flags |= IOMAP_F_DIRTY; This is the very definition of an inode that is "fdatasync dirty". H, shouldn't this also be set for read faults, too? Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > From: Christoph Hellwig> > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > blocks for writing and the inode is pinned, and has dirty fields other > than the timestamps. In __xfs_filemap_fault() we then detect this case > and call dax_finish_sync_fault() to make sure all metadata is committed, > and to insert the page table entry. > > Note that this will also dirty corresponding radix tree entry which is > what we want - fsync(2) will still provide data integrity guarantees for > applications not using userspace flushing. And applications using > userspace flushing can avoid calling fsync(2) and thus avoid the > performance overhead. > > [JK: Added VM_SYNC flag handling] > > Signed-off-by: Christoph Hellwig > Signed-off-by: Jan Kara I don't know enough about XFS dirty inode handling to be able to comment on the changes in xfs_file_iomap_begin(), but the rest looks good. Reviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 17/17] xfs: support for synchronous DAX faults
From: Christoph HellwigReturn IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare blocks for writing and the inode is pinned, and has dirty fields other than the timestamps. In __xfs_filemap_fault() we then detect this case and call dax_finish_sync_fault() to make sure all metadata is committed, and to insert the page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. [JK: Added VM_SYNC flag handling] Signed-off-by: Christoph Hellwig Signed-off-by: Jan Kara --- fs/xfs/xfs_file.c | 15 ++- fs/xfs/xfs_iomap.c | 5 + 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7c6b8def6eed..02093df4b314 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -44,6 +44,7 @@ #include #include #include +#include static const struct vm_operations_struct xfs_file_vm_ops; @@ -1040,7 +1041,11 @@ __xfs_filemap_fault( xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (IS_DAX(inode)) { - ret = dax_iomap_fault(vmf, pe_size, NULL, _iomap_ops); + pfn_t pfn; + + ret = dax_iomap_fault(vmf, pe_size, , _iomap_ops); + if (ret & VM_FAULT_NEEDDSYNC) + ret = dax_finish_sync_fault(vmf, pe_size, pfn); } else { if (write_fault) ret = iomap_page_mkwrite(vmf, _iomap_ops); @@ -1131,6 +1136,13 @@ xfs_file_mmap( struct file *filp, struct vm_area_struct *vma) { + /* +* We don't support synchronous mappings for non-DAX files. At least +* until someone comes with a sensible use case. +*/ + if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) + return -EOPNOTSUPP; + file_accessed(filp); vma->vm_ops = _file_vm_ops; if (IS_DAX(file_inode(filp))) @@ -1149,6 +1161,7 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, + .mmap_supported_flags = MAP_SYNC, .open = xfs_file_open, .release= xfs_file_release, .fsync = xfs_file_fsync, diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index f179bdf1644d..b43be199fbdf 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -33,6 +33,7 @@ #include "xfs_error.h" #include "xfs_trans.h" #include "xfs_trans_space.h" +#include "xfs_inode_item.h" #include "xfs_iomap.h" #include "xfs_trace.h" #include "xfs_icache.h" @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( trace_xfs_iomap_found(ip, offset, length, 0, ); } + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) + iomap->flags |= IOMAP_F_DIRTY; + xfs_bmbt_to_iomap(ip, iomap, ); if (shared) -- 2.12.3 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_SYNC) I'd kill this define. Also is there any good reason that we have to add LEGACY_MAP_MASK instead of assuming it's supported in the core? > #endif > .mmap = xfs_file_mmap, > + .mmap_supported_flags = XFS_MAP_SUPPORTED, > .open = xfs_file_open, > .release= xfs_file_release, > .fsync = xfs_file_fsync, I usually either reformat all members to be aligned again, or if that's too much churn (in this case it probably is) just use a single space before the = to minimize the alignment differences. Otherwise your changes look good to me. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 17/17] xfs: support for synchronous DAX faults
From: Christoph HellwigReturn IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare blocks for writing and the inode is pinned, and has dirty fields other than the timestamps. In __xfs_filemap_fault() we then detect this case and call dax_finish_sync_fault() to make sure all metadata is committed, and to insert the page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. [JK: Added VM_SYNC flag handling] Signed-off-by: Christoph Hellwig Signed-off-by: Jan Kara --- fs/xfs/xfs_file.c | 17 - fs/xfs/xfs_iomap.c | 5 + 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 7c6b8def6eed..9b8058774af3 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -44,6 +44,7 @@ #include #include #include +#include static const struct vm_operations_struct xfs_file_vm_ops; @@ -1040,7 +1041,11 @@ __xfs_filemap_fault( xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (IS_DAX(inode)) { - ret = dax_iomap_fault(vmf, pe_size, NULL, _iomap_ops); + pfn_t pfn; + + ret = dax_iomap_fault(vmf, pe_size, , _iomap_ops); + if (ret & VM_FAULT_NEEDDSYNC) + ret = dax_finish_sync_fault(vmf, pe_size, pfn); } else { if (write_fault) ret = iomap_page_mkwrite(vmf, _iomap_ops); @@ -1131,6 +1136,13 @@ xfs_file_mmap( struct file *filp, struct vm_area_struct *vma) { + /* +* We don't support synchronous mappings for non-DAX files. At least +* until someone comes with a sensible use case. +*/ + if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC)) + return -EOPNOTSUPP; + file_accessed(filp); vma->vm_ops = _file_vm_ops; if (IS_DAX(file_inode(filp))) @@ -1138,6 +1150,8 @@ xfs_file_mmap( return 0; } +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_SYNC) + const struct file_operations xfs_file_operations = { .llseek = xfs_file_llseek, .read_iter = xfs_file_read_iter, @@ -1149,6 +1163,7 @@ const struct file_operations xfs_file_operations = { .compat_ioctl = xfs_file_compat_ioctl, #endif .mmap = xfs_file_mmap, + .mmap_supported_flags = XFS_MAP_SUPPORTED, .open = xfs_file_open, .release= xfs_file_release, .fsync = xfs_file_fsync, diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index f179bdf1644d..b43be199fbdf 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -33,6 +33,7 @@ #include "xfs_error.h" #include "xfs_trans.h" #include "xfs_trans_space.h" +#include "xfs_inode_item.h" #include "xfs_iomap.h" #include "xfs_trace.h" #include "xfs_icache.h" @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( trace_xfs_iomap_found(ip, offset, length, 0, ); } + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) + iomap->flags |= IOMAP_F_DIRTY; + xfs_bmbt_to_iomap(ip, iomap, ); if (shared) -- 2.12.3 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm