Re: Very slow balance / btrfs-transaction

Filipe Manana Wed, 08 Feb 2017 06:06:50 -0800

On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> At 02/07/2017 11:55 PM, Filipe Manana wrote:
>>
>> On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>>
>>> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:
>>>>
>>>>
>>>>
>>>> Hi Qu,
>>>>
>>>> On 02/05/2017 07:45 PM, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:
>>>>>>
>>>>>>
>>>>>> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" <rgold...@suse.de>
>>>>>> wrote:
>>>>
>>>>
>>>>
>>>> <snipped>
>>>>
>>>>>>
>>>>>>
>>>>>> Quata support was indeed active -- and it warned me that the qroup
>>>>>> data was inconsistent.
>>>>>>
>>>>>> Disabling quotas had an immediate impact on balance throughput -- it's
>>>>>> *much* faster now!
>>>>>> From a quick glance at iostat I would guess it's at least a factor 100
>>>>>> faster.
>>>>>>
>>>>>>
>>>>>> Should quota support generally be disabled during balances? Or did I
>>>>>> somehow push my fs into a weired state where it triggered a slow-path?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>    j
>>>>>
>>>>>
>>>>>
>>>>> Would you please provide the kernel version?
>>>>>
>>>>> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
>>>>> fix qgroup bytes leaking, but also hugely slow down the balance
>>>>> process:
>>>>>
>>>>> commit 62b99540a1d91e46422f0e04de50fc723812c421
>>>>> Author: Qu Wenruo <quwen...@cn.fujitsu.com>
>>>>> Date:   Mon Aug 15 10:36:51 2016 +0800
>>>>>
>>>>>     btrfs: relocation: Fix leaking qgroups numbers on data extents
>>>>>
>>>>> Sorry for that.
>>>>>
>>>>> And in v4.10, a better method is applied to fix the byte leaking
>>>>> problem, and should be a little faster than previous one.
>>>>>
>>>>> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
>>>>> Author: Qu Wenruo <quwen...@cn.fujitsu.com>
>>>>> Date:   Tue Oct 18 09:31:29 2016 +0800
>>>>>
>>>>>     btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
>>>>>
>>>>>
>>>>> However, using balance with qgroup is still slower than balance without
>>>>> qgroup, the root fix needs us to rework current backref iteration.
>>>>>
>>>>
>>>> This patch has made the btrfs balance performance worse. The balance
>>>> task has become more CPU intensive compared to earlier and takes longer
>>>> to complete, besides hogging resources. While correctness is important,
>>>> we need to figure out how this can be made more efficient.
>>>>
>>> The cause is already known.
>>>
>>> It's find_parent_node() which takes most of the time to find all
>>> referencer
>>> of an extent.
>>>
>>> And it's also the cause for FIEMAP softlockup (fixed in recent release by
>>> early quit).
>>>
>>> The biggest problem is, current find_parent_node() uses list to iterate,
>>> which is quite slow especially it's done in a loop.
>>> In real world find_parent_node() is about O(n^3).
>>> We can either improve find_parent_node() by using rb_tree, or introduce
>>> some
>>> cache for find_parent_node().
>>
>>
>> Even if anyone is able to reduce that function's complexity from
>> O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
>> implementation of qgroups will always be a problem. The real problem
>> is that this more recent rework of qgroups does all this accounting
>> inside the critical section of a transaction - blocking any other
>> tasks that want to start a new transaction or attempt to join the
>> current transaction. Not to mention that on systems with small amounts
>> of memory (2Gb or 4Gb from what I've seen from user reports) we also
>> OOM due this allocation of struct btrfs_qgroup_extent_record per
>> delayed data reference head, that are used for that accounting phase
>> in the critical section of a transaction commit.
>>
>> Let's face it and be realistic, even if someone manages to make
>> find_parent_node() much much better, like O(n) for example, it will
>> always be a problem due to the reasons mentioned before. Many extents
>> touched per transaction and many subvolumes/snapshots, will always
>> expose that root problem - doing the accounting in the transaction
>> commit critical section.
>
>
> You must accept the fact that we must call find_parent_node() at least twice
> to get correct owner modification for each touched extent.
> Or qgroup number will never be correct.
>
> One for old_roots by searching commit root, and one for new_roots by
> searching current root.
>
> You can call find_parent_node() as many time as you like, but that's just
> wasting your CPU time.
>
> Only the final find_parent_node() will determine new_roots for that extent,
> and there is no better timing than commit_transaction().


You're missing my point.

My point is not about needing to call find_parent_nodes() nor how many
times to call it, or whether it's needed or not. My point is about
doing expensive things inside the critical section of a transaction
commit, which leads not only to low performance but getting a system
becoming unresponsive and with too high latency - and this is not
theory or speculation, there are upstream reports about this as well
as several in suse's bugzilla, all caused when qgroups are enabled on
4.2+ kernels (when the last qgroups major changes landed).

Judging from that code and from your reply to this and other threads
it seems you didn't understand the consequences of doing all that
accounting stuff inside the critical section of a transaction commit.

Forget find_parent_nodes() for a moment, yes it has problems, but it's
not what I'm trying to make you understand. Just look at
btrfs_qgroup_account_extents(), called within the critical section -
it iterates all elements of a red black tree, and each element
corresponds to some data extent allocated in the current transaction -
if we have thousands, tens of thousands, or more, even if whatever the
loop calls had an awesome complexity of O(1) or O(log N) it would
still be bad, exactly because it's blocking future transactions to
start and tasks from joining the current transaction. CPU time and
memory consumption (used for those struct btrfs_qgroup_extent_record)
are also concerns, but to a smaller extent imo.

>
> Or you can wasting more time calling find_parent_node() every time you
> touched a extent, saving one find_parent_node() in commit_transaction() with
> the cost of more find_parent_node() in other place.
> Is that what you want?
>
> I can move the find_parent_node() for old_roots out of commit_transaction().
> But that will only reduce 50% of the time spent on commit_transaction().
>
> Compared to O(n^3) find_parent_node(), that's not the determining fact even.
>
> Thanks,
> Qu
>
>
>>
>>>
>>>
>>> IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but
>>> I
>>> didn't hear anything about it recently.
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Very slow balance / btrfs-transaction

Reply via email to