Hi,

You may notice my @wdc.com address is bounced. As my internship job at
wdc is finished, the address is already expired.
Please contact me on this @elisp.net address.

2017-10-11 2:42 GMT+09:00 Hans van Kranenburg <hans.van.kranenb...@mendix.com>:
> Sorry for the mail spam, it's an interesting code puzzle... :)
>
> On 10/10/2017 07:22 PM, Hans van Kranenburg wrote:
>> On 10/10/2017 07:07 PM, Hans van Kranenburg wrote:
>>> On 10/10/2017 01:31 PM, David Sterba wrote:
>>>> Hi,
>>>>
>>>> On Fri, Sep 29, 2017 at 04:20:51PM +0900, Naohiro Aota wrote:
>>>>> Balancing a fresh METADATA=dup btrfs file system (with size < 50G)
>>>>> generates a 128MB sized block group. While we set max_stripe_size =
>>>>> max_chunk_size = 256MB, we get this half sized block group:
>>>>>
>>>>> $ btrfs ins dump-t -t CHUNK_TREE btrfs.img|grep length
>>>>>                 length 8388608 owner 2 stripe_len 65536 type DATA
>>>>>                 length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
>>>>>                 length 134217728 owner 2 stripe_len 65536 type 
>>>>> METADATA|DUP
>>>>>
>>>>> Before commit 86db25785a6e ("Btrfs: fix max chunk size on raid5/6"), we
>>>>> used "stripe_size * ndevs > max_chunk_size * ncopies" to check the max
>>>>> chunk size. Since stripe_size = 256MB * dev_stripes (= 2) = 512MB, ndevs
>>>>> = 1, max_chunk_size = 256MB, and ncopies = 2, we allowed 256MB
>>>>> METADATA|DUP block group.
>>>>>
>>>>> But now, we use "stripe_size * data_stripes > max_chunk_size". Since
>>>>> data_stripes = 1 on DUP, it disallows the block group to have > 128MB.
>>>>> What missing here is "dev_stripes". Proper logical space used by the block
>>>>> group is "stripe_size * data_stripes / dev_stripes". Tweak the equations 
>>>>> to
>>>>> use the right value.
>>>>
>>>> I started looking into it and still don't fully understand it. Change
>>>> deep in the allocator can easily break some blockgroup combinations, so
>>>> I'm rather conservative here.
>>>
>>> I think that the added usage of data_stripes in 86db25785a6e is the
>>> problematic change. data_stripes is something that was introduced as
>>> part of RAID56 in 53b381b3a and clearly only has a meaning that's
>>> properly thought of for RAID56. The RAID56 commit already adds "this
>>> will have to be fixed for RAID1 and RAID10 over more drives", only the
>>> author doesn't catch the DUP case, which already breaks at that point.
>>>

Thank you for explaining in detail :) Yes, the allocator was already
broken by using the data_stripes.
(and it will also "break" existing file systems, in terms of making a
DUP|META block group smaller)
I believe this patch fix the things.

>>> At the beginning it says:
>>>
>>> int data_stripes; /* number of stripes that count for block group size */
>>>
>>> For the example:
>>>
>>> This is DUP:
>>>
>>> .sub_stripes    = 1,
>>> .dev_stripes    = 2,
>>> .devs_max       = 1,
>>> .devs_min       = 1,
>>> .tolerated_failures = 0,
>>> .devs_increment = 1,
>>> .ncopies        = 2,
>>>
>>> In the code:
>>>
>>> max_stripe_size = SZ_256M
>>> max_chunk_size = max_stripe_size    -> SZ_256M
>>>
>>> Then we have find_free_dev_extent:
>>>     max_stripe_size * dev_stripes   -> SZ_256M * 2 -> 512M
>>>
>>> So we like to find 512M on a disk, to stuff 2 stripes of 256M inside for
>>> the DUP. (remember: The two parts of DUP *never* end up on a different
>>> disk, even if you have multiple)
>>>
>
> Another better fix would be to make a change here...
>
>>> If we find one:
>>> stripe_size = devices_info[ndevs-1].max_avail   -> 512M, yay
>
> ...because this is not yay. The 512M is max_avail, which needs to holds
> *2* stripes, not 1. So stripe_size is suddenly changed to twice the
> stripe size for DUP.
>
> So, an additional division again by dev_stripes would translate the
> max_avail on device ndevs-1 to the stripe size to use.

This catch the point. Notice stripe_size is divided by dev_stripes
after the lines in the patch.
It means that at this region (from "stripe_size =
devices_info[ndevs-1].max_avail;" to "stripe_size =
div_u64(stripe_size, dev_stripes);"),
"stripe_size" stands for "stripe size written in one device". IOW,
stripe_size (here, size of region in one device) = dev_stripes *
stripe_size (final, one chunk).
These two "stripe_size" is mostly the same value, because we set
dev_stripes > 1 only with DUP.

We are using this stripe_size in one device manner to calculate logical size.
This is correct, if we actually stripe write to stripes in one device.
But, it actually is
copied stripes at least with current implementation. So the
if-condition "if (stripe_size * data_stripes > max_chunk_size)"
is doing wrong calculation.

So the solution would be:

1. as in my patch, consider dev_stripes in the if-condition
or,
2. as Hans suggests, move the division by dev_stripes above the if-condition,
   and properly re-calculate stripe_size if logical size > max_chunk_size

Now, I think taking the 2nd solution would make the code more clear,
because it eliminate two different meanings of "stripe_size"

Again, keep in mind that "dev_stripes = 1" on all BG types but DUP.
So, multiply/divide with "dev_stripes" won't affect any BG but DUP BG
(at least on current RAID feature set).

>
>>>
>>> num_stripes = ndevs * dev_stripes -> 1 * 2  -> 2
>>>
>>> data_stripes = num_stripes / ncopies = 2 / 2 = 1
>>
>> Oh, wow, this is not true of course, because the "number of stripes that
>> count for block group size" should still be 1 for DUP...
>>
>>> BOOM! There's the problem. The data_stripes only thinks about data that
>>> is horizontally spread over disks, and not vertically spread...
>>
>> Hm... no Hans, you're wrong.
>>
>>> What I would propose is changing...
>>>    the data_stripes = <blah> and afterwards trying to correct it with
>>> some ifs
>>> ...to...
>>>    a switch/case thing where the explicit logic is put to get the right
>>> value for each specific raid type.
>>>
>>> In case of DUP this simply means data_stripes = 2, because there is no
>>> need for fancy calculations about spreading DUP data over X devices.
>>> It's always 2 copies on 1 device.
>>
>> Eh, nope. 1.
>>
>> So, then I end up at the "stripe_size * data_stripes > max_chunk_size"
>> again.
>>
>> So, yes, Naohiro is right, and DUP is the only case in which this logic
>> breaks. DUP is the only one in which it this change makes a difference,
>> because it's the only one which has dev_stripes set to something else
>> than 1.
>
> If stripe_size doesn't incorrectly get set to dev_stripes * stripe_size
> max_avail above, the if works.
>
> The comments below still stand, because doing all those calculations
> with stripes and number of devices for something as predictable as DUP
> is only confusing. :D
>
>>
>> \o/
>>
>>> ...
>>>
>>> My general feeling when looking at the code, is that this single part of
>>> code is responsible for too many different cases, or, more possible
>>> cases than a developer can reason about at once "in his head" when
>>> working on it.
>>>
>>> 7 raid options * 3 different types (data, metadata, system) = 21
>>> already... Some parts of the algorithm make only sense for a subset of
>>> the combinations, but they're still part of the computation, which
>>> sometimes "by accident" results in the correct outcome. :)
>>>
>>> If it can't be done in a way that's easier to understand when reading
>>> the code, it should have unit tests with a list of known input/output to
>>> detect unwanted changes.

I agree. The code is too complicated for current RAID feature set,
though it may support future
expansion naturally.

>>>
>>
>>
> --
> Hans van Kranenburg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to