On Thu, Sep 18, 2014 at 01:36:47PM +0800, Qu Wenruo wrote:
> 
> -------- Original Message --------
> Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping()
> to insert best fitted extent map
> From: Liu Bo <bo.li....@oracle.com>
> To: Qu Wenruo <quwen...@cn.fujitsu.com>
> Date: 2014年09月18日 12:21
> >On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote:
> >>The following commit enhanced the merge_extent_mapping() to reduce
> >>fragment in extent map tree, but it can't handle case which existing
> >>lies before map_start:
> >>51f39 btrfs: Use right extent length when inserting overlap extent map.
> >>
> >>[BUG]
> >>When existing extent map's start is before map_start,
> >>the em->len will be minus, which will corrupt the extent map and fail to
> >>insert the new extent map.
> >>This will happen when someone get a large extent map, but when it is
> >>going to insert it into extent map tree, some one has already commit
> >>some write and split the huge extent into small parts.
> >>
> >>[REPRODUCER]
> >>It is very easy to tiger using filebench with randomrw personality.
> >>It is about 100% to reproduce when using 8G preallocated file in 60s
> >>randonrw test.
> >>
> >>[FIX]
> >>This patch can now handle any existing extent position.
> >>Since it does not directly use existing->start, now it will find the
> >>previous and next extent around map_start.
> >>So the old existing->start < map_start bug will never happen again.
> >>
> >>[ENHANCE]
> >>This patch will insert the best fitted extent map into extent map tree,
> >>other than the oldest [map_start, map_start + sectorsize) or the
> >>relatively newer but not perfect [map_start, existing->start).
> >>
> >>The patch will first search existing extent that does not intersects with
> >>the desired map range [map_start, map_start + len).
> >>The existing extent will be either before or behind map_start, and based
> >>on the existing extent, we can find out the previous and next extent
> >>around map_start.
> >>
> >>So the best fitted extent would be [prev->end, next->start).
> >>For prev or next is not found, em->start would be prev->end and em->end
> >>wold be next->start.
> >>
> >>With this patch, the fragment in extent map tree should be reduced much
> >>more than the 51f39 commit and reduce an unneeded extent map tree search.
> >>
> >>Reported-by: Tsutomu Itoh <t-i...@jp.fujitsu.com>
> >>Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
> >>---
> >>  fs/btrfs/inode.c | 79 
> >> ++++++++++++++++++++++++++++++++++++++++----------------
> >>  1 file changed, 57 insertions(+), 22 deletions(-)
> >>
> >>diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> >>index 016c403..8039021 100644
> >>--- a/fs/btrfs/inode.c
> >>+++ b/fs/btrfs/inode.c
> >>@@ -6191,21 +6191,60 @@ out_fail_inode:
> >>    goto out_fail;
> >>  }
> >>+/* Find next extent map of a given extent map, caller needs to ensure 
> >>locks */
> >>+static struct extent_map *next_extent_map(struct extent_map *em)
> >>+{
> >>+   struct rb_node *next;
> >>+
> >>+   next = rb_next(&em->rb_node);
> >>+   if (!next)
> >>+           return NULL;
> >>+   return container_of(next, struct extent_map, rb_node);
> >>+}
> >>+
> >>+static struct extent_map *prev_extent_map(struct extent_map *em)
> >>+{
> >>+   struct rb_node *prev;
> >>+
> >>+   prev = rb_prev(&em->rb_node);
> >>+   if (!prev)
> >>+           return NULL;
> >>+   return container_of(prev, struct extent_map, rb_node);
> >>+}
> >>+
> >>  /* helper for btfs_get_extent.  Given an existing extent in the tree,
> >>+ * the existing extent is the nearest extent to map_start,
> >>   * and an extent that you want to insert, deal with overlap and insert
> >>- * the new extent into the tree.
> >>+ * the best fitted new extent into the tree.
> >>   */
> >>  static int merge_extent_mapping(struct extent_map_tree *em_tree,
> >>                            struct extent_map *existing,
> >>                            struct extent_map *em,
> >>                            u64 map_start)
> >>  {
> >>+   struct extent_map *prev;
> >>+   struct extent_map *next;
> >>+   u64 start;
> >>+   u64 end;
> >>    u64 start_diff;
> >>    BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
> >>-   start_diff = map_start - em->start;
> >>-   em->start = map_start;
> >>-   em->len = existing->start - em->start;
> >>+
> >>+   if (existing->start > map_start) {
> >>+           next = existing;
> >>+           prev = prev_extent_map(next);
> >>+   } else {
> >>+           prev = existing;
> >>+           next = next_extent_map(prev);
> >>+   }
> >>+
> >>+   start = prev ? extent_map_end(prev) : em->start;
> >>+   start = max_t(u64, start, em->start);
> >>+   end = next ? next->start : extent_map_end(em);
> >>+   end = min_t(u64, end, extent_map_end(em));
> >>+   start_diff = start - em->start;
> >>+   em->start = start;
> >>+   em->len = end - start;
> >>    if (em->block_start < EXTENT_MAP_LAST_BYTE &&
> >>        !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
> >>            em->block_start += start_diff;
> >>@@ -6482,25 +6521,21 @@ insert:
> >>            ret = 0;
> >>-           existing = lookup_extent_mapping(em_tree, start, len);
> >>-           if (existing && (existing->start > start ||
> >>-               existing->start + existing->len <= start)) {
> >>+           existing = search_extent_mapping(em_tree, start, len);
> >>+           /*
> >>+            * existing will always be non-NULL, since there must be
> >>+            * extent causing the -EEXIST.
> >>+            */
> >>+           if (start >= extent_map_end(existing) ||
> >>+               start + len <= existing->start) {
> >This will introduce something wrong, the 'else' part is 'em = existing;',
> >and the condition is actually
> >(start < extent_map_end(existing) && start + len > existing->start),
> >which means the existing overlaps with [start, start+len).
> Nope, the else part is doing the right thing.
> 
> Before the patch, going to the 'em = existing;' routine's condition
> is like the following:
> 1) existing returned by lookup_extent_mapping is not NULL
> 2) (existing->start > start || existing->start + existing->len
> <=start) is not met
> 
> 1) implies the following condition: (in extent_map.c,
> __lookup_extent_mapping())
> !!(end > existing->start && start < extent_map_end(existing)), which
> is equal to the following:
> start + len > existing->start(1) && start < extent_map_end(existing) (2)
> 
> 2) is actually the following
> start >= existing->start (3) && start < extent_map_end(existing) (4)
> 
> And the hidden condition len > 0(5)
> combining 1) and 2), you will find the real condition to go to 'em =
> existing' routine is what the patch does.
> Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too.
> So the patch is OK. 'em = existing' condition is not broken.

Okay, but think it twice, they're not same,

original:   (start >= existing->start && start < extent_map_end(existing))
this patch: (start < extent_map_end(existing) && start + len > existing->start)

(start + len > existing->start) doesn't equal to start >= existing->start,
here is a case of (start+len > existing->start) but (start <= existing->start).

       |--------| -->(existing)
    |--------|    -->[start, start+len)

And calling search_extent_mapping() doesn't make sure that
(start >= existing->start) is true, either.

> 
> >
> >And one of overlapping cases is (existing->start > start), ie. em->start > 
> >start, this is
> >against our rule of btrfs_get_extent,
> Nope again, this overlapping in fact is quite normal in multithread
> random read/write.
> The files's [0~16) is a preallocated one,
> Thread A:
>     write [4K, 8K) into the file, but not committed yet.
>     extent map tree contains [0,16K) only
> Thread B:
> btrfs_get_extent()
>     the map_start is 8K, len is 4K as an example
>     grab a large em, take [0,16K), since [4K,8K) write is not committed.
>     comes to insert: btrfs_release_path(path);
> 
> Thread A:
>     [4K, 8K) is not committed
>     the extent map is now [0, 4K) [4K, 8K) [8K, 16K).
> 
> Thread B:
>     goes to insert: add_extent_mapping()
>     the [0,16K) is overlapping, and the returned existing one is [8K, 16K).
>     which contains the [map_start, map_start +  len).

So this's an example of existing->start == start (both are 8K), not
existing->start > start.

See __extent_writepage_io(),

{
                ...
                em = epd->get_extent(inode, page, pg_offset, cur,               
 
                                     end - cur + 1, 1);                         
 
                if (IS_ERR_OR_NULL(em)) {                                       
 
                        SetPageError(page);                                     
 
                        ret = PTR_ERR_OR_ZERO(em);                              
 
                        break;                                                  
 
                }                                                               
 
                                                                                
 
                extent_offset = cur - em->start;                                
 
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur)

                ...
}

thanks,
-liubo

> 
> >struct extent_map *btrfs_get_extent(...)
> >{
> >     [...]
> >     insert:
> >             btrfs_release_path(path);
> >             if (em->start > start || extent_map_end(em) <= start) {
> >                     btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] 
> > passed
> >     [%llu %llu]",
> >                             em->start, em->len, start, len);
> >                     err = -EIO;
> >                 goto out;
> >         }
> >         [...]
> >}
> >
> >thanks,
> >-liubo
> >
> >>+                   /*
> >>+                    * The existing extent map is the one nearest to
> >>+                    * the [start, start + len) range which overlaps
> >>+                    */
> >>+                   err = merge_extent_mapping(em_tree, existing,
> >>+                                              em, start);
> >>                    free_extent_map(existing);
> >>-                   existing = NULL;
> >>-           }
> >>-           if (!existing) {
> >>-                   existing = lookup_extent_mapping(em_tree, em->start,
> >>-                                                    em->len);
> >>-                   if (existing) {
> >>-                           err = merge_extent_mapping(em_tree, existing,
> >>-                                                      em, start);
> >>-                           free_extent_map(existing);
> >>-                           if (err) {
> >>-                                   free_extent_map(em);
> >>-                                   em = NULL;
> >>-                           }
> >>-                   } else {
> >>-                           err = -EIO;
> >>+                   if (err) {
> >>                            free_extent_map(em);
> >>                            em = NULL;
> >>                    }
> >>-- 
> >>2.1.0
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> >>the body of a message to majord...@vger.kernel.org
> >>More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to