On Fri, Jan 05, 2018 at 12:51:09PM -0700, Liu Bo wrote:
> This fixes a corner case that is caused by a race of dio write vs dio
> read/write.
> 
> Here is how the race could happen.
> 
> Suppose that no extent map has been loaded into memory yet.
> There is a file extent [0, 32K), two jobs are running concurrently
> against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
> read from [0, 4K) or [4K, 8K).
> 
> t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
> 
> ------------------------------------------------------
>              t1                                t2
>       btrfs_get_blocks_direct()         btrfs_get_blocks_direct()
>        -> btrfs_get_extent()              -> btrfs_get_extent()
>            -> lookup_extent_mapping()
>            -> add_extent_mapping()            -> lookup_extent_mapping()
>               # load [0, 32K)
>        -> btrfs_new_extent_direct()
>            -> btrfs_drop_extent_cache()
>               # split [0, 32K) and
>             # drop [8K, 32K)
>            -> add_extent_mapping()
>               # add [8K, 32K)
>                                               -> add_extent_mapping()
>                                                  # handle -EEXIST when adding
>                                                  # [0, 32K)
> ------------------------------------------------------
> About how t2(dio read/write) runs into -EEXIST:
> 
> a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),
> 
> b) search_extent_mapping() then returns [0, 8k) as the existing em,
>    even though start == existing->start, em is [0, 32k) so that
>    extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,
> 
> c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
>    (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,
> 
> d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
>    which is confusing applications.
> 
> Here I conclude all the possible situations,
> 1) start < existing->start
> 
>             +-----------+em+-----------+
> +--prev---+ |     +-------------+      |
> |         | |     |             |      |
> +---------+ +     +---+existing++      ++
>                 +
>                 |
>                 +
>              start
> 
> 2) start == existing->start
> 
>       +------------em------------+
>       |     +-------------+      |
>       |     |             |      |
>       +     +----existing-+      +
>             |
>             |
>             +
>          start
> 
> 3) start > existing->start && start < (existing->start + existing->len)
> 
>       +------------em------------+
>       |     +-------------+      |
>       |     |             |      |
>       +     +----existing-+      +
>                |
>                |
>                +
>              start
> 
> 4) start >= (existing->start + existing->len)
> 
> +-----------+em+-----------+
> |     +-------------+      | +--next---+
> |     |             |      | |         |
> +     +---+existing++      + +---------+
>                       +
>                       |
>                       +
>                    start
> 
> As we can see, it turns out that if start is within existing em (front
> inclusive), then the existing em should be returned as is, otherwise,
> we try our best to merge candidate em with sibling ems to form a
> larger em (in order to reduce the total number of em).
> 
> Reported-by: David Vallender <david.vallen...@landmark.co.uk>
> Signed-off-by: Liu Bo <bo.li....@oracle.com>

Reviewed-by: Josef Bacik <jba...@fb.com>

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to