Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-24 Thread Jan Kara
On Thu 24-08-17 05:31:26, Christoph Hellwig wrote:
> On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> > dirty corresponding radix tree entry which is what we want - fsync(2)
> > will still provide data integrity guarantees for applications not using
> > userspace flushing. And applications using userspace flushing can avoid
> > calling fsync(2) and thus avoid the performance overhead.
> 
> Why is this only wiered up for the huge_fault handler and not the
> regular?

We do handle both. Just ext4 naming is a bit confusing and ext4_dax_fault()
uses ext4_dax_huge_fault() for handling.

Honza
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-24 Thread Christoph Hellwig
On Thu, Aug 24, 2017 at 05:31:26AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> > dirty corresponding radix tree entry which is what we want - fsync(2)
> > will still provide data integrity guarantees for applications not using
> > userspace flushing. And applications using userspace flushing can avoid
> > calling fsync(2) and thus avoid the performance overhead.
> 
> Why is this only wiered up for the huge_fault handler and not the
> regular?

Ah, turns out ext4 implements ->fault in terms of ->huge_fault.

We'll really need to sort out this mess of fault handlers before
doing too much surgery here..
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-24 Thread Christoph Hellwig
On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> dirty corresponding radix tree entry which is what we want - fsync(2)
> will still provide data integrity guarantees for applications not using
> userspace flushing. And applications using userspace flushing can avoid
> calling fsync(2) and thus avoid the performance overhead.

Why is this only wiered up for the huge_fault handler and not the
regular?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-24 Thread Jan Kara
On Wed 23-08-17 11:37:14, Christoph Hellwig wrote:
> > +   pfn_t pfn;
> >  
> > if (write) {
> > sb_start_pagefault(sb);
> > @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
> > down_read(_I(inode)->i_mmap_sem);
> > handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
> >EXT4_DATA_TRANS_BLOCKS(sb));
> > +   if (IS_ERR(handle)) {
> > +   up_read(_I(inode)->i_mmap_sem);
> > +   sb_end_pagefault(sb);
> > +   return VM_FAULT_SIGBUS;
> > +   }
> > } else {
> > down_read(_I(inode)->i_mmap_sem);
> > }
> > -   if (!IS_ERR(handle))
> > -   result = dax_iomap_fault(vmf, pe_size, _iomap_ops, NULL);
> > -   else
> > -   result = VM_FAULT_SIGBUS;
> > +   result = dax_iomap_fault(vmf, pe_size, _iomap_ops, );
> 
> Maybe split the error handling refactor into a simple prep patch to
> make this one more readable?

OK, will do.

> > +   /* Write fault but PFN mapped only RO? */
> > +   if (result & VM_FAULT_NEEDDSYNC) {
> > +   int err;
> > +   loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> > +   size_t len = 0;
> > +
> > +   if (pe_size == PE_SIZE_PTE)
> > +   len = PAGE_SIZE;
> > +#ifdef CONFIG_FS_DAX_PMD
> > +   else if (pe_size == PE_SIZE_PMD)
> > +   len = HPAGE_PMD_SIZE;
> > +#endif
> > +   else
> > +   WARN_ON_ONCE(1);
> > +   err = vfs_fsync_range(vmf->vma->vm_file, start,
> > + start + len - 1, 1);
> > +   if (err)
> > +   result = VM_FAULT_SIGBUS;
> > +   else
> > +   result = dax_insert_pfn_mkwrite(vmf, pe_size,
> > +   pfn);
> > +   }
> 
> I think this needs to become a helper exported from the DAX code,
> way too much magic inside the file system as-is.

Good point, there isn't anything fs specific in there.

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-23 Thread Christoph Hellwig
> + pfn_t pfn;
>  
>   if (write) {
>   sb_start_pagefault(sb);
> @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>   down_read(_I(inode)->i_mmap_sem);
>   handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
>  EXT4_DATA_TRANS_BLOCKS(sb));
> + if (IS_ERR(handle)) {
> + up_read(_I(inode)->i_mmap_sem);
> + sb_end_pagefault(sb);
> + return VM_FAULT_SIGBUS;
> + }
>   } else {
>   down_read(_I(inode)->i_mmap_sem);
>   }
> - if (!IS_ERR(handle))
> - result = dax_iomap_fault(vmf, pe_size, _iomap_ops, NULL);
> - else
> - result = VM_FAULT_SIGBUS;
> + result = dax_iomap_fault(vmf, pe_size, _iomap_ops, );

Maybe split the error handling refactor into a simple prep patch to
make this one more readable?

> + /* Write fault but PFN mapped only RO? */
> + if (result & VM_FAULT_NEEDDSYNC) {
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = HPAGE_PMD_SIZE;
> +#endif
> + else
> + WARN_ON_ONCE(1);
> + err = vfs_fsync_range(vmf->vma->vm_file, start,
> +   start + len - 1, 1);
> + if (err)
> + result = VM_FAULT_SIGBUS;
> + else
> + result = dax_insert_pfn_mkwrite(vmf, pe_size,
> + pfn);
> + }

I think this needs to become a helper exported from the DAX code,
way too much magic inside the file system as-is.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-22 Thread Jan Kara
On Mon 21-08-17 13:19:48, Ross Zwisler wrote:
> On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> 
> Need to fix up the above line a little -
> s/dax_pfn_mkwrite/dax_insert_pfn_mkwrite/, and we insert the PTE as well as
> make it writeable.

Fixed up, thanks.

> > if (write) {
> > -   if (!IS_ERR(handle))
> > -   ext4_journal_stop(handle);
> > +   ext4_journal_stop(handle);
> > +   /* Write fault but PFN mapped only RO? */
> 
> The above comment is out of date.

Fixed.

> > +   if (result & VM_FAULT_NEEDDSYNC) {
> > +   int err;
> > +   loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> > +   size_t len = 0;
> > +
> > +   if (pe_size == PE_SIZE_PTE)
> > +   len = PAGE_SIZE;
> > +#ifdef CONFIG_FS_DAX_PMD
> > +   else if (pe_size == PE_SIZE_PMD)
> > +   len = HPAGE_PMD_SIZE;
> 
> In fs/dax.c we always use PMD_SIZE.  It looks like HPAGE_PMD_SIZE and PMD_SIZE
> are always the same (from include/linux/huge_mm.h, the only defintion of
> HPAGE_PMD_SIZE):
> 
> #define HPAGE_PMD_SHIFT PMD_SHIFT
> #define HPAGE_PMD_SIZE((1UL) << HPAGE_PMD_SHIFT)
> 
> and AFAICT PMD_SIZE is defined to be 1< well.  I don't understand why we have both?
> 
> In any case, neither HPAGE_PMD_SIZE nor PMD_SIZE are used anywhere else in the
> ext4 code, so can we use PMD_SIZE here for consistency?  If they ever did
> manage to be different, I think we'd want PMD_SIZE anyway.

Yeah, I've changed that to PMD_SIZE.

> With those nits and an updated changelog:
> 
> Reviewed-by: Ross Zwisler 

Thanks!

Honza

-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-21 Thread Ross Zwisler
On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also

Need to fix up the above line a little -
s/dax_pfn_mkwrite/dax_insert_pfn_mkwrite/, and we insert the PTE as well as
make it writeable.

> dirty corresponding radix tree entry which is what we want - fsync(2)
> will still provide data integrity guarantees for applications not using
> userspace flushing. And applications using userspace flushing can avoid
> calling fsync(2) and thus avoid the performance overhead.
> 
> Signed-off-by: Jan Kara 
> ---
>  fs/ext4/file.c   | 36 ++--
>  fs/ext4/inode.c  |  4 
>  fs/jbd2/journal.c| 17 +
>  include/linux/jbd2.h |  1 +
>  4 files changed, 52 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 850037e140d7..3765c4ed1368 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -280,6 +280,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>   struct inode *inode = file_inode(vmf->vma->vm_file);
>   struct super_block *sb = inode->i_sb;
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
> + pfn_t pfn;
>  
>   if (write) {
>   sb_start_pagefault(sb);
> @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>   down_read(_I(inode)->i_mmap_sem);
>   handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
>  EXT4_DATA_TRANS_BLOCKS(sb));
> + if (IS_ERR(handle)) {
> + up_read(_I(inode)->i_mmap_sem);
> + sb_end_pagefault(sb);
> + return VM_FAULT_SIGBUS;
> + }
>   } else {
>   down_read(_I(inode)->i_mmap_sem);
>   }
> - if (!IS_ERR(handle))
> - result = dax_iomap_fault(vmf, pe_size, _iomap_ops, NULL);
> - else
> - result = VM_FAULT_SIGBUS;
> + result = dax_iomap_fault(vmf, pe_size, _iomap_ops, );
>   if (write) {
> - if (!IS_ERR(handle))
> - ext4_journal_stop(handle);
> + ext4_journal_stop(handle);
> + /* Write fault but PFN mapped only RO? */

The above comment is out of date.

> + if (result & VM_FAULT_NEEDDSYNC) {
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = HPAGE_PMD_SIZE;

In fs/dax.c we always use PMD_SIZE.  It looks like HPAGE_PMD_SIZE and PMD_SIZE
are always the same (from include/linux/huge_mm.h, the only defintion of
HPAGE_PMD_SIZE):

#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE  ((1UL) << HPAGE_PMD_SHIFT)

and AFAICT PMD_SIZE is defined to be 1<
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-17 Thread Jan Kara
We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
synchronous write fault when inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case,
call vfs_fsync_range() to make sure all metadata is committed, and call
dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
dirty corresponding radix tree entry which is what we want - fsync(2)
will still provide data integrity guarantees for applications not using
userspace flushing. And applications using userspace flushing can avoid
calling fsync(2) and thus avoid the performance overhead.

Signed-off-by: Jan Kara 
---
 fs/ext4/file.c   | 36 ++--
 fs/ext4/inode.c  |  4 
 fs/jbd2/journal.c| 17 +
 include/linux/jbd2.h |  1 +
 4 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 850037e140d7..3765c4ed1368 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -280,6 +280,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
struct inode *inode = file_inode(vmf->vma->vm_file);
struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+   pfn_t pfn;
 
if (write) {
sb_start_pagefault(sb);
@@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
down_read(_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
   EXT4_DATA_TRANS_BLOCKS(sb));
+   if (IS_ERR(handle)) {
+   up_read(_I(inode)->i_mmap_sem);
+   sb_end_pagefault(sb);
+   return VM_FAULT_SIGBUS;
+   }
} else {
down_read(_I(inode)->i_mmap_sem);
}
-   if (!IS_ERR(handle))
-   result = dax_iomap_fault(vmf, pe_size, _iomap_ops, NULL);
-   else
-   result = VM_FAULT_SIGBUS;
+   result = dax_iomap_fault(vmf, pe_size, _iomap_ops, );
if (write) {
-   if (!IS_ERR(handle))
-   ext4_journal_stop(handle);
+   ext4_journal_stop(handle);
+   /* Write fault but PFN mapped only RO? */
+   if (result & VM_FAULT_NEEDDSYNC) {
+   int err;
+   loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
+   size_t len = 0;
+
+   if (pe_size == PE_SIZE_PTE)
+   len = PAGE_SIZE;
+#ifdef CONFIG_FS_DAX_PMD
+   else if (pe_size == PE_SIZE_PMD)
+   len = HPAGE_PMD_SIZE;
+#endif
+   else
+   WARN_ON_ONCE(1);
+   err = vfs_fsync_range(vmf->vma->vm_file, start,
+ start + len - 1, 1);
+   if (err)
+   result = VM_FAULT_SIGBUS;
+   else
+   result = dax_insert_pfn_mkwrite(vmf, pe_size,
+   pfn);
+   }
up_read(_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c600f02673f..7a7529c3f0c8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3429,6 +3429,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t 
offset, loff_t length,
}
 
iomap->flags = 0;
+   if ((flags & IOMAP_WRITE) &&
+   !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
+   EXT4_I(inode)->i_datasync_tid))
+   iomap->flags |= IOMAP_F_NEEDDSYNC;
bdev = inode->i_sb->s_bdev;
iomap->bdev = bdev;
if (blk_queue_dax(bdev->bd_queue))
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7d5ef3bf3f3e..fa8cde498b4b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -738,6 +738,23 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
return err;
 }
 
+/* Return 1 when transaction with given tid has already committed. */
+int jbd2_transaction_committed(journal_t *journal, tid_t tid)
+{
+   int ret = 1;
+
+   read_lock(>j_state_lock);
+   if (journal->j_running_transaction &&
+   journal->j_running_transaction->t_tid == tid)
+   ret = 0;
+   if (journal->j_committing_transaction &&
+   journal->j_committing_transaction->t_tid == tid)
+   ret = 0;
+   read_unlock(>j_state_lock);
+   return ret;
+}
+EXPORT_SYMBOL(jbd2_transaction_committed);
+
 /*
  * When this function returns the transaction corresponding to tid
  * will be completed.  If the transaction has currently running, start
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index