-------- Original Message --------
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map
From: Liu Bo <bo.li....@oracle.com>
To: Qu Wenruo <quwen...@cn.fujitsu.com>
Date: 2014年09月18日 15:33
On Thu, Sep 18, 2014 at 01:36:47PM +0800, Qu Wenruo wrote:
-------- Original Message --------
Subject: Re: [PATCH] btrfs: Fix and enhance merge_extent_mapping()
to insert best fitted extent map
From: Liu Bo <bo.li....@oracle.com>
To: Qu Wenruo <quwen...@cn.fujitsu.com>
Date: 2014年09月18日 12:21
On Wed, Sep 17, 2014 at 11:53:35AM +0800, Qu Wenruo wrote:
The following commit enhanced the merge_extent_mapping() to reduce
fragment in extent map tree, but it can't handle case which existing
lies before map_start:
51f39 btrfs: Use right extent length when inserting overlap extent map.

[BUG]
When existing extent map's start is before map_start,
the em->len will be minus, which will corrupt the extent map and fail to
insert the new extent map.
This will happen when someone get a large extent map, but when it is
going to insert it into extent map tree, some one has already commit
some write and split the huge extent into small parts.

[REPRODUCER]
It is very easy to tiger using filebench with randomrw personality.
It is about 100% to reproduce when using 8G preallocated file in 60s
randonrw test.

[FIX]
This patch can now handle any existing extent position.
Since it does not directly use existing->start, now it will find the
previous and next extent around map_start.
So the old existing->start < map_start bug will never happen again.

[ENHANCE]
This patch will insert the best fitted extent map into extent map tree,
other than the oldest [map_start, map_start + sectorsize) or the
relatively newer but not perfect [map_start, existing->start).

The patch will first search existing extent that does not intersects with
the desired map range [map_start, map_start + len).
The existing extent will be either before or behind map_start, and based
on the existing extent, we can find out the previous and next extent
around map_start.

So the best fitted extent would be [prev->end, next->start).
For prev or next is not found, em->start would be prev->end and em->end
wold be next->start.

With this patch, the fragment in extent map tree should be reduced much
more than the 51f39 commit and reduce an unneeded extent map tree search.

Reported-by: Tsutomu Itoh <t-i...@jp.fujitsu.com>
Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
---
  fs/btrfs/inode.c | 79 ++++++++++++++++++++++++++++++++++++++++----------------
  1 file changed, 57 insertions(+), 22 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 016c403..8039021 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6191,21 +6191,60 @@ out_fail_inode:
        goto out_fail;
  }
+/* Find next extent map of a given extent map, caller needs to ensure locks */
+static struct extent_map *next_extent_map(struct extent_map *em)
+{
+       struct rb_node *next;
+
+       next = rb_next(&em->rb_node);
+       if (!next)
+               return NULL;
+       return container_of(next, struct extent_map, rb_node);
+}
+
+static struct extent_map *prev_extent_map(struct extent_map *em)
+{
+       struct rb_node *prev;
+
+       prev = rb_prev(&em->rb_node);
+       if (!prev)
+               return NULL;
+       return container_of(prev, struct extent_map, rb_node);
+}
+
  /* helper for btfs_get_extent.  Given an existing extent in the tree,
+ * the existing extent is the nearest extent to map_start,
   * and an extent that you want to insert, deal with overlap and insert
- * the new extent into the tree.
+ * the best fitted new extent into the tree.
   */
  static int merge_extent_mapping(struct extent_map_tree *em_tree,
                                struct extent_map *existing,
                                struct extent_map *em,
                                u64 map_start)
  {
+       struct extent_map *prev;
+       struct extent_map *next;
+       u64 start;
+       u64 end;
        u64 start_diff;
        BUG_ON(map_start < em->start || map_start >= extent_map_end(em));
-       start_diff = map_start - em->start;
-       em->start = map_start;
-       em->len = existing->start - em->start;
+
+       if (existing->start > map_start) {
+               next = existing;
+               prev = prev_extent_map(next);
+       } else {
+               prev = existing;
+               next = next_extent_map(prev);
+       }
+
+       start = prev ? extent_map_end(prev) : em->start;
+       start = max_t(u64, start, em->start);
+       end = next ? next->start : extent_map_end(em);
+       end = min_t(u64, end, extent_map_end(em));
+       start_diff = start - em->start;
+       em->start = start;
+       em->len = end - start;
        if (em->block_start < EXTENT_MAP_LAST_BYTE &&
            !test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
                em->block_start += start_diff;
@@ -6482,25 +6521,21 @@ insert:
                ret = 0;
-               existing = lookup_extent_mapping(em_tree, start, len);
-               if (existing && (existing->start > start ||
-                   existing->start + existing->len <= start)) {
+               existing = search_extent_mapping(em_tree, start, len);
+               /*
+                * existing will always be non-NULL, since there must be
+                * extent causing the -EEXIST.
+                */
+               if (start >= extent_map_end(existing) ||
+                   start + len <= existing->start) {
This will introduce something wrong, the 'else' part is 'em = existing;',
and the condition is actually
(start < extent_map_end(existing) && start + len > existing->start),
which means the existing overlaps with [start, start+len).
Nope, the else part is doing the right thing.

Before the patch, going to the 'em = existing;' routine's condition
is like the following:
1) existing returned by lookup_extent_mapping is not NULL
2) (existing->start > start || existing->start + existing->len
<=start) is not met

1) implies the following condition: (in extent_map.c,
__lookup_extent_mapping())
!!(end > existing->start && start < extent_map_end(existing)), which
is equal to the following:
start + len > existing->start(1) && start < extent_map_end(existing) (2)

2) is actually the following
start >= existing->start (3) && start < extent_map_end(existing) (4)

And the hidden condition len > 0(5)
combining 1) and 2), you will find the real condition to go to 'em =
existing' routine is what the patch does.
Due to (5), (1) and (3) is the same condition, and (2) (4) is the same too.
So the patch is OK. 'em = existing' condition is not broken.
Okay, but think it twice, they're not same,

original:   (start >= existing->start && start < extent_map_end(existing))
this patch: (start < extent_map_end(existing) && start + len > existing->start)

(start + len > existing->start) doesn't equal to start >= existing->start,
here is a case of (start+len > existing->start) but (start <= existing->start).

        |--------| -->(existing)
     |--------|    -->[start, start+len)

And calling search_extent_mapping() doesn't make sure that
(start >= existing->start) is true, either.
All right, that case is right and I'm wrong.
I'll change the if() to use start >= existing->start.

BTW, Any other problem?

Thanks,
Qu

And one of overlapping cases is (existing->start > start), ie. em->start > 
start, this is
against our rule of btrfs_get_extent,
Nope again, this overlapping in fact is quite normal in multithread
random read/write.
The files's [0~16) is a preallocated one,
Thread A:
     write [4K, 8K) into the file, but not committed yet.
     extent map tree contains [0,16K) only
Thread B:
btrfs_get_extent()
     the map_start is 8K, len is 4K as an example
     grab a large em, take [0,16K), since [4K,8K) write is not committed.
     comes to insert: btrfs_release_path(path);

Thread A:
     [4K, 8K) is not committed
     the extent map is now [0, 4K) [4K, 8K) [8K, 16K).

Thread B:
     goes to insert: add_extent_mapping()
     the [0,16K) is overlapping, and the returned existing one is [8K, 16K).
     which contains the [map_start, map_start +  len).
So this's an example of existing->start == start (both are 8K), not
existing->start > start.

See __extent_writepage_io(),

{
                ...
                 em = epd->get_extent(inode, page, pg_offset, cur,
                                      end - cur + 1, 1);
                 if (IS_ERR_OR_NULL(em)) {
                         SetPageError(page);
                         ret = PTR_ERR_OR_ZERO(em);
                         break;
                 }
extent_offset = cur - em->start;
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^ it needs to be (em->start <= cur)

                ...
}

thanks,
-liubo

struct extent_map *btrfs_get_extent(...)
{
        [...]
        insert:
                btrfs_release_path(path);
                if (em->start > start || extent_map_end(em) <= start) {
                        btrfs_err(root->fs_info, "bad extent! em: [%llu %llu] 
passed
        [%llu %llu]",
                                em->start, em->len, start, len);
                        err = -EIO;
                 goto out;
         }
         [...]
}

thanks,
-liubo

+                       /*
+                        * The existing extent map is the one nearest to
+                        * the [start, start + len) range which overlaps
+                        */
+                       err = merge_extent_mapping(em_tree, existing,
+                                                  em, start);
                        free_extent_map(existing);
-                       existing = NULL;
-               }
-               if (!existing) {
-                       existing = lookup_extent_mapping(em_tree, em->start,
-                                                        em->len);
-                       if (existing) {
-                               err = merge_extent_mapping(em_tree, existing,
-                                                          em, start);
-                               free_extent_map(existing);
-                               if (err) {
-                                       free_extent_map(em);
-                                       em = NULL;
-                               }
-                       } else {
-                               err = -EIO;
+                       if (err) {
                                free_extent_map(em);
                                em = NULL;
                        }
--
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to