On Thu, Apr 20, 2017 at 09:52:00AM +0800, Qu Wenruo wrote:
> 
> 
> At 04/20/2017 09:25 AM, Liu Bo wrote:
> > On Fri, Apr 07, 2017 at 10:43:15AM +0800, Qu Wenruo wrote:
> > > [BUG]
> > > Cycle mount btrfs can cause fiemap to return different result.
> > > Like:
> > >   # mount /dev/vdb5 /mnt/btrfs
> > >   # dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > >     0: [0..127]:        25088..25215       128   0x1
> > >   # umount /mnt/btrfs
> > >   # mount /dev/vdb5 /mnt/btrfs
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > >     0: [0..31]:         25088..25119        32   0x0
> > >     1: [32..63]:        25120..25151        32   0x0
> > >     2: [64..95]:        25152..25183        32   0x0
> > >     3: [96..127]:       25184..25215        32   0x1
> > > But after above fiemap, we get correct merged result if we call fiemap
> > > again.
> > >   # xfs_io -c "fiemap -v" /mnt/btrfs/file
> > >   /mnt/test/file:
> > >   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
> > >     0: [0..127]:        25088..25215       128   0x1
> > > 
> > > [REASON]
> > > Btrfs will try to merge extent map when inserting new extent map.
> > > 
> > > btrfs_fiemap(start=0 len=(u64)-1)
> > > |- extent_fiemap(start=0 len=(u64)-1)
> > >     |- get_extent_skip_holes(start=0 len=64k)
> > >     |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> > >     |     |- btrfs_get_extent(start=0 len=64k)
> > >     |        |  Found on-disk (ino, EXTENT_DATA, 0)
> > >     |        |- add_extent_mapping()
> > >     |        |- Return (em->start=0, len=16k)
> > >     |
> > >     |- fiemap_fill_next_extent(logic=0 phys=X len=16k)
> > >     |
> > >     |- get_extent_skip_holes(start=0 len=64k)
> > >     |  |- btrfs_get_extent_fiemap(start=0 len=64k)
> > >     |     |- btrfs_get_extent(start=16k len=48k)
> > >     |        |  Found on-disk (ino, EXTENT_DATA, 16k)
> > >     |        |- add_extent_mapping()
> > >     |        |  |- try_merge_map()
> > >     |        |     Merge with previous em start=0 len=16k
> > >     |        |     resulting em start=0 len=32k
> > >     |        |- Return (em->start=0, len=32K)    << Merged result
> > >     |- Stripe off the unrelated range (0~16K) of return em
> > >     |- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
> > >        ^^^ Causing split fiemap extent.
> > > 
> > > And since in add_extent_mapping(), em is already merged, in next
> > > fiemap() call, we will get merged result.
> > > 
> > > [FIX]
> > > Here we introduce a new structure, fiemap_cache, which records previous
> > > fiemap extent.
> > > 
> > > And will always try to merge current fiemap_cache result before calling
> > > fiemap_fill_next_extent().
> > > Only when we failed to merge current fiemap extent with cached one, we
> > > will call fiemap_fill_next_extent() to submit cached one.
> > > 
> > > So by this method, we can merge all fiemap extents.
> > > 
> > 
> > If I understand it correctly, what it's missing currently is 'merge', a
> > straightfoward idea is to make use of the 'merge' ability of 
> > btrfs_get_extent. >
> > Since btrfs_get_extent_fiemap is a wrapper of btrfs_get_extent, does it make
> > sense if we make btrfs_get_extent_fiemap iterate all extent maps within the
> > range passing to it or just one more extent map to check if btrfs_get_extent
> > could return a merged extent map before returning?
> 
> So your idea to to do the merging inside btrfs_get_extent(), instead of
> merging it in extent_fiemap()?
>

No, merge ems inside the wrapper, ie. btrfs_get_extent_fiemap().

Thanks,

-liubo
> In theory, they have the same effect for fiemap.
> And that's my original idea.
> 
> But the problem is, btrfs_get_extent() is called almost everywhere, more
> frequently than extent_fiemap(), the extra time used to merging extent map
> may cause performance problem.
> 
> For the worst case, if a file is made up by 262144 4K extent (takes up 1G),
> btrfs_get_extent() call on the file will iterate all these extents,
> which will iterate more than 500 16K tree blocks.
> 
> If only fiemap takes such longer time, it's still acceptable. But if
> btrfs_get_extent() takes such longer time, I think it will be a problem
> then.
> 
> Thanks,
> Qu
> 
> 
> > 
> > If no extent map could be merged, which is the worst case, the first
> > btrfs_get_extent_fiemap will read file metadata into memory from disk and 
> > the
> > following btrfs_get_extent_fiemap will read the rest of file metadata from
> > the in-memory extent map tree directly.
> > 
> > Thanks,
> > 
> > -liubo
> > 
> > > It can also be done in fs/ioctl.c, however the problem is if
> > > fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
> > > extent.
> > > So I choose to merge it in btrfs.
> > > 
> > > Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> > > ---
> > > v2:
> > >    Since fiemap_extent_info has a limit for number of fiemap_extent, it's 
> > > possible
> > >    that fiemap_fill_next_extent() return 1 halfway. Remove the WARN_ON() 
> > > which can
> > >    cause kernel warning if we fiemap is called on large compressed file.
> > > v3:
> > >    Rename finish_fiemap_extent() to check_fiemap_extent(), as in v3 we 
> > > ensured
> > >    submit_fiemap_extent() to submit fiemap cache, so it just acts as a
> > >    sanity check.
> > >    Remove BTRFS_MAX_EXTENT_SIZE limit in submit_fiemap_extent(), as
> > >    extent map can be larger than BTRFS_MAX_EXTENT_SIZE.
> > >    Don't do backward jump, suggested by David.
> > >    Better sanity check and recoverable fix.
> > > 
> > > To David:
> > >    What about adding a btrfs_debug_warn(), which will only call 
> > > WARN_ON(1) if
> > >    BTRFS_CONFIG_DEBUG is specified for recoverable bug?
> > > 
> > >    And modify ASSERT() to always WARN_ON() and exit error code?
> > > ---
> > >   fs/btrfs/extent_io.c | 124 
> > > ++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 122 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > index 28e8192..c4cb65d 100644
> > > --- a/fs/btrfs/extent_io.c
> > > +++ b/fs/btrfs/extent_io.c
> > > @@ -4353,6 +4353,123 @@ static struct extent_map 
> > > *get_extent_skip_holes(struct inode *inode,
> > >           return NULL;
> > >   }
> > > +/*
> > > + * To cache previous fiemap extent
> > > + *
> > > + * Will be used for merging fiemap extent
> > > + */
> > > +struct fiemap_cache {
> > > + u64 offset;
> > > + u64 phys;
> > > + u64 len;
> > > + u32 flags;
> > > + bool cached;
> > > +};
> > > +
> > > +/*
> > > + * Helper to submit fiemap extent.
> > > + *
> > > + * Will try to merge current fiemap extent specified by @offset, @phys,
> > > + * @len and @flags with cached one.
> > > + * And only when we fails to merge, cached one will be submitted as
> > > + * fiemap extent.
> > > + *
> > > + * Return value is the same as fiemap_fill_next_extent().
> > > + */
> > > +static int submit_fiemap_extent(struct fiemap_extent_info *fieinfo,
> > > +                         struct fiemap_cache *cache,
> > > +                         u64 offset, u64 phys, u64 len, u32 flags)
> > > +{
> > > + int ret = 0;
> > > +
> > > + if (!cache->cached)
> > > +         goto assign;
> > > +
> > > + /*
> > > +  * Sanity check, extent_fiemap() should have ensured that new
> > > +  * fiemap extent won't overlap with cahced one.
> > > +  * Not recoverable.
> > > +  *
> > > +  * NOTE: Physical address can overlap, due to compression
> > > +  */
> > > + if (cache->offset + cache->len > offset) {
> > > +         WARN_ON(1);
> > > +         return -EINVAL;
> > > + }
> > > +
> > > + /*
> > > +  * Only merges fiemap extents if
> > > +  * 1) Their logical addresses are continuous
> > > +  *
> > > +  * 2) Their physical addresses are continuous
> > > +  *    So truly compressed (physical size smaller than logical size)
> > > +  *    extents won't get merged with each other
> > > +  *
> > > +  * 3) Share same flags except FIEMAP_EXTENT_LAST
> > > +  *    So regular extent won't get merged with prealloc extent
> > > +  */
> > > + if (cache->offset + cache->len  == offset &&
> > > +     cache->phys + cache->len == phys  &&
> > > +     (cache->flags & ~FIEMAP_EXTENT_LAST) ==
> > > +                 (flags & ~FIEMAP_EXTENT_LAST)) {
> > > +         cache->len += len;
> > > +         cache->flags |= flags;
> > > +         goto try_submit_last;
> > > + }
> > > +
> > > + /* Not mergeable, need to submit cached one */
> > > + ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
> > > +                               cache->len, cache->flags);
> > > + cache->cached = false;
> > > + if (ret)
> > > +         return ret;
> > > +assign:
> > > + cache->cached = true;
> > > + cache->offset = offset;
> > > + cache->phys = phys;
> > > + cache->len = len;
> > > + cache->flags = flags;
> > > +try_submit_last:
> > > + if (cache->flags & FIEMAP_EXTENT_LAST) {
> > > +         ret = fiemap_fill_next_extent(fieinfo, cache->offset,
> > > +                         cache->phys, cache->len, cache->flags);
> > > +         cache->cached = false;
> > > + }
> > > + return ret;
> > > +}
> > > +
> > > +/*
> > > + * Sanity check for fiemap cache
> > > + *
> > > + * All fiemap cache should be submitted by submit_fiemap_extent()
> > > + * Iteration should be terminated either by last fiemap extent or
> > > + * fieinfo->fi_extents_max.
> > > + * So no cached fiemap should exist.
> > > + */
> > > +static int check_fiemap_cache(struct btrfs_fs_info *fs_info,
> > > +                        struct fiemap_extent_info *fieinfo,
> > > +                        struct fiemap_cache *cache)
> > > +{
> > > + int ret;
> > > +
> > > + if (!cache->cached)
> > > +         return 0;
> > > +
> > > + /* Small and recoverbale problem, only to info developer */
> > > +#ifdef CONFIG_BTRFS_DEBUG
> > > + WARN_ON(1);
> > > +#endif
> > > + btrfs_warn(fs_info,
> > > +            "unhandled fiemap cache detected: offset=%llu phys=%llu 
> > > len=%llu flags=0x%x",
> > > +            cache->offset, cache->phys, cache->len, cache->flags);
> > > + ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
> > > +                               cache->len, cache->flags);
> > > + cache->cached = false;
> > > + if (ret > 0)
> > > +         ret = 0;
> > > + return ret;
> > > +}
> > > +
> > >   int extent_fiemap(struct inode *inode, struct fiemap_extent_info 
> > > *fieinfo,
> > >                   __u64 start, __u64 len, get_extent_t *get_extent)
> > >   {
> > > @@ -4370,6 +4487,7 @@ int extent_fiemap(struct inode *inode, struct 
> > > fiemap_extent_info *fieinfo,
> > >           struct extent_state *cached_state = NULL;
> > >           struct btrfs_path *path;
> > >           struct btrfs_root *root = BTRFS_I(inode)->root;
> > > + struct fiemap_cache cache = { 0 };
> > >           int end = 0;
> > >           u64 em_start = 0;
> > >           u64 em_len = 0;
> > > @@ -4549,8 +4667,8 @@ int extent_fiemap(struct inode *inode, struct 
> > > fiemap_extent_info *fieinfo,
> > >                           flags |= FIEMAP_EXTENT_LAST;
> > >                           end = 1;
> > >                   }
> > > -         ret = fiemap_fill_next_extent(fieinfo, em_start, disko,
> > > -                                       em_len, flags);
> > > +         ret = submit_fiemap_extent(fieinfo, &cache, em_start, disko,
> > > +                                    em_len, flags);
> > >                   if (ret) {
> > >                           if (ret == 1)
> > >                                   ret = 0;
> > > @@ -4558,6 +4676,8 @@ int extent_fiemap(struct inode *inode, struct 
> > > fiemap_extent_info *fieinfo,
> > >                   }
> > >           }
> > >   out_free:
> > > + if (!ret)
> > > +         ret = check_fiemap_cache(root->fs_info, fieinfo, &cache);
> > >           free_extent_map(em);
> > >   out:
> > >           btrfs_free_path(path);
> > > -- 
> > > 2.9.3
> > > 
> > > 
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to