On Thu 20-04-17 21:44:37, Ross Zwisler wrote:
> Users of DAX can suffer data corruption from stale mmap reads via the
> following sequence:
> 
> - open an mmap over a 2MiB hole
> 
> - read from a 2MiB hole, faulting in a 2MiB zero page
> 
> - write to the hole with write(3p).  The write succeeds but we incorrectly
>   leave the 2MiB zero page mapping intact.
> 
> - via the mmap, read the data that was just written.  Since the zero page
>   mapping is still intact we read back zeroes instead of the new data.
> 
> We fix this by unconditionally calling invalidate_inode_pages2_range() in
> dax_iomap_actor() for new block allocations, and by enhancing
> __dax_invalidate_mapping_entry() so that it properly unmaps the DAX entry
> being removed from the radix tree.
> 
> This is based on an initial patch from Jan Kara.
> 
> Signed-off-by: Ross Zwisler <ross.zwis...@linux.intel.com>
> Fixes: c6dcf52c23d2 ("mm: Invalidate DAX radix tree entries only if 
> appropriate")
> Reported-by: Jan Kara <j...@suse.cz>
> Cc: <sta...@vger.kernel.org>    [4.10+]
> ---
>  fs/dax.c | 26 +++++++++++++++++++-------
>  1 file changed, 19 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 166504c..3f445d5 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -468,23 +468,35 @@ static int __dax_invalidate_mapping_entry(struct 
> address_space *mapping,
>                                         pgoff_t index, bool trunc)
>  {
>       int ret = 0;
> -     void *entry;
> +     void *entry, **slot;
>       struct radix_tree_root *page_tree = &mapping->page_tree;
>  
>       spin_lock_irq(&mapping->tree_lock);
> -     entry = get_unlocked_mapping_entry(mapping, index, NULL);
> +     entry = get_unlocked_mapping_entry(mapping, index, &slot);
>       if (!entry || !radix_tree_exceptional_entry(entry))
>               goto out;
>       if (!trunc &&
>           (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
>            radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
>               goto out;
> +
> +     /*
> +      * Make sure 'entry' remains valid while we drop mapping->tree_lock to
> +      * do the unmap_mapping_range() call.
> +      */
> +     entry = lock_slot(mapping, slot);

This also stops page faults from mapping the entry again. Maybe worth
mentioning here as well.

> +     spin_unlock_irq(&mapping->tree_lock);
> +
> +     unmap_mapping_range(mapping, (loff_t)index << PAGE_SHIFT,
> +                     (loff_t)PAGE_SIZE << dax_radix_order(entry), 0);

Ouch, unmapping entry-by-entry may get quite expensive if you are unmapping
large ranges - each unmap means an rmap walk... Since this is a data
corruption class of bug, let's fix it this way for now but I think we'll
need to improve this later.

E.g. what if we called unmap_mapping_range() for the whole invalidated
range after removing the radix tree entries?

Hum, but now thinking more about it I have hard time figuring out why write
vs fault cannot actually still race:

CPU1 - write(2)                         CPU2 - read fault

                                        dax_iomap_pte_fault()
                                          ->iomap_begin() - sees hole
dax_iomap_rw()
  iomap_apply()
    ->iomap_begin - allocates blocks
    dax_iomap_actor()
      invalidate_inode_pages2_range()
        - there's nothing to invalidate
                                          grab_mapping_entry()
                                          - we add zero page in the radix
                                            tree & map it to page tables

Similarly read vs write fault may end up racing in a wrong way and try to
replace already existing exceptional entry with a hole page?

                                                                Honza
> +
> +     spin_lock_irq(&mapping->tree_lock);
>       radix_tree_delete(page_tree, index);
>       mapping->nrexceptional--;
>       ret = 1;
>  out:
> -     put_unlocked_mapping_entry(mapping, index, entry);
>       spin_unlock_irq(&mapping->tree_lock);
> +     dax_wake_mapping_entry_waiter(mapping, index, entry, true);
>       return ret;
>  }
>  /*
> @@ -999,11 +1011,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>               return -EIO;
>  
>       /*
> -      * Write can allocate block for an area which has a hole page mapped
> -      * into page tables. We have to tear down these mappings so that data
> -      * written by write(2) is visible in mmap.
> +      * Write can allocate block for an area which has a hole page or zero
> +      * PMD entry in the radix tree.  We have to tear down these mappings so
> +      * that data written by write(2) is visible in mmap.
>        */
> -     if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
> +     if (iomap->flags & IOMAP_F_NEW) {
>               invalidate_inode_pages2_range(inode->i_mapping,
>                                             pos >> PAGE_SHIFT,
>                                             (end - 1) >> PAGE_SHIFT);
> -- 
> 2.9.3
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR

Reply via email to