Re: Very slow balance / btrfs-transaction

2017-07-01 Thread Sidney San Martín
February 3, 2017 11:26 PM, "Goldwyn Rodrigues"  wrote:
> On 02/03/2017 04:13 PM, j...@capsec.org wrote:
> > Hi, 
> > 
> > 
> > I'm currently running a balance (without any filters) on a 4 drives raid1 
> > filesystem. The array contains 3 3TB drives and one 6TB drive; I'm running 
> > the rebalance because the 6TB drive recently replaced a 2TB drive. 
> > 
> > 
> > I know that balance is not supposed to be a fast operation, but this one is 
> > now running for ~6 days and it managed to balance ~18% (754 out of about 
> > 4250 
> > chunks balanced (755 considered),  82% left) -- so I expect it to take 
> > another ~4 weeks. 
> > 
> > That seems excessively slow for ~8TiB of data.
> > 
> > 
> > Is this expected behavior? In case it's not: Is there anything I can do to 
> > help debug it?
> 
> Do you have quotas enabled?
> 
> -- 
> Goldwyn

Just dropping in — I don’t normally follow the list but I found this thread 
when I was troubleshooting balance issues (kernel 4.11, converting raid1 to 
raid10). Disabling quotas had an immense impact on performance and it would be 
helpful if notes could be added in *lots* of places. With quotas on, each block 
group took 30 minutes to over an hour to convert, and the system was only 
usable for a few seconds per iteration:

Jun 28 00:42:41 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7141922439168 flags data|raid1
Jun 28 01:32:13 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7140848697344 flags data|raid1
Jun 28 02:48:59 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7139774955520 flags data|raid1
Jun 28 03:50:12 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7138701213696 flags data|raid1
Jun 28 05:20:58 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7137627471872 flags data|raid1
Jun 28 06:49:00 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7136553730048 flags data|raid1
Jun 28 07:23:58 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7135479988224 flags data|raid1
Jun 28 08:03:39 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7134406246400 flags data|raid1
Jun 28 08:40:11 overkill kernel: BTRFS info (device sdc2): relocating block 
group 712504576 flags data|raid1
Jun 28 09:44:46 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7132258762752 flags data|raid1
Jun 28 10:24:17 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7131185020928 flags data|raid1
Jun 28 11:35:39 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7130111279104 flags data|raid1
Jun 28 12:53:56 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7129037537280 flags data|raid1
Jun 28 13:37:00 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7127963795456 flags data|raid1
Jun 28 14:32:19 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7126890053632 flags data|raid1
Jun 28 15:45:19 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7125816311808 flags data|raid1
Jun 28 16:30:01 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7124742569984 flags data|raid1
Jun 28 17:26:57 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7123668828160 flags data|raid1
Jun 28 18:15:01 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7122595086336 flags data|raid1
Jun 28 18:48:05 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7121521344512 flags data|raid1
Jun 28 19:25:59 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7120447602688 flags data|raid1
Jun 28 19:55:46 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7119373860864 flags data|raid1
Jun 28 20:30:41 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7118300119040 flags data|raid1
Jun 28 21:28:43 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7117226377216 flags data|raid1
Jun 28 22:55:34 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7114005151744 flags data|raid1
Jun 28 23:19:06 overkill kernel: BTRFS info (device sdc2): relocating block 
group 7110783926272 flags data|raid1

With quotas off, it takes ~20 seconds to convert each block group and the 
system is completely usable:

Jul 01 09:56:42 overkill kernel: BTRFS info (device sde): relocating block 
group 7085014122496 flags data|raid1
Jul 01 09:56:59 overkill kernel: BTRFS info (device sde): relocating block 
group 7083940380672 flags data|raid1
Jul 01 09:57:18 overkill kernel: BTRFS info (device sde): relocating block 
group 7082866638848 flags data|raid1
Jul 01 09:57:39 overkill kernel: BTRFS info (device sde): relocating block 
group 7081792897024 flags data|raid1
Jul 01 09:58:01 overkill kernel: BTRFS info (device 

Re: Very slow balance / btrfs-transaction

2017-02-08 Thread Qu Wenruo



At 02/08/2017 09:56 PM, Filipe Manana wrote:

On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo  wrote:



At 02/07/2017 11:55 PM, Filipe Manana wrote:


On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo 
wrote:




At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:




Hi Qu,

On 02/05/2017 07:45 PM, Qu Wenruo wrote:





At 02/04/2017 09:47 AM, Jorg Bornschein wrote:



February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
wrote:









Quata support was indeed active -- and it warned me that the qroup
data was inconsistent.

Disabling quotas had an immediate impact on balance throughput -- it's
*much* faster now!
From a quick glance at iostat I would guess it's at least a factor 100
faster.


Should quota support generally be disabled during balances? Or did I
somehow push my fs into a weired state where it triggered a slow-path?



Thanks!

   j




Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely
fix qgroup bytes leaking, but also hugely slow down the balance
process:

commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo 
Date:   Mon Aug 15 10:36:51 2016 +0800

btrfs: relocation: Fix leaking qgroups numbers on data extents

Sorry for that.

And in v4.10, a better method is applied to fix the byte leaking
problem, and should be a little faster than previous one.

commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo 
Date:   Tue Oct 18 09:31:29 2016 +0800

btrfs: qgroup: Fix qgroup data leaking by using subtree tracing


However, using balance with qgroup is still slower than balance without
qgroup, the root fix needs us to rework current backref iteration.



This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.


The cause is already known.

It's find_parent_node() which takes most of the time to find all
referencer
of an extent.

And it's also the cause for FIEMAP softlockup (fixed in recent release by
early quit).

The biggest problem is, current find_parent_node() uses list to iterate,
which is quite slow especially it's done in a loop.
In real world find_parent_node() is about O(n^3).
We can either improve find_parent_node() by using rb_tree, or introduce
some
cache for find_parent_node().



Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.

Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.



You must accept the fact that we must call find_parent_node() at least twice
to get correct owner modification for each touched extent.
Or qgroup number will never be correct.

One for old_roots by searching commit root, and one for new_roots by
searching current root.

You can call find_parent_node() as many time as you like, but that's just
wasting your CPU time.

Only the final find_parent_node() will determine new_roots for that extent,
and there is no better timing than commit_transaction().


You're missing my point.

My point is not about needing to call find_parent_nodes() nor how many
times to call it, or whether it's needed or not. My point is about
doing expensive things inside the critical section of a transaction
commit, which leads not only to low performance but getting a system
becoming unresponsive and with too high latency - and this is not
theory or speculation, there are upstream reports about this as well
as several in suse's bugzilla, all caused when qgroups are enabled on
4.2+ kernels (when the last qgroups major changes landed).

Judging from that code and from your reply to this and other threads
it seems you didn't understand the consequences of doing all that
accounting stuff inside the critical section of a transaction commit.


NO, I know what you're talking about.
Or I won't send the patch to 

Re: Very slow balance / btrfs-transaction

2017-02-08 Thread Filipe Manana
On Wed, Feb 8, 2017 at 12:39 AM, Qu Wenruo  wrote:
>
>
> At 02/07/2017 11:55 PM, Filipe Manana wrote:
>>
>> On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo 
>> wrote:
>>>
>>>
>>>
>>> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:



 Hi Qu,

 On 02/05/2017 07:45 PM, Qu Wenruo wrote:
>
>
>
>
> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:
>>
>>
>> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
>> wrote:



 

>>
>>
>> Quata support was indeed active -- and it warned me that the qroup
>> data was inconsistent.
>>
>> Disabling quotas had an immediate impact on balance throughput -- it's
>> *much* faster now!
>> From a quick glance at iostat I would guess it's at least a factor 100
>> faster.
>>
>>
>> Should quota support generally be disabled during balances? Or did I
>> somehow push my fs into a weired state where it triggered a slow-path?
>>
>>
>>
>> Thanks!
>>
>>j
>
>
>
> Would you please provide the kernel version?
>
> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
> fix qgroup bytes leaking, but also hugely slow down the balance
> process:
>
> commit 62b99540a1d91e46422f0e04de50fc723812c421
> Author: Qu Wenruo 
> Date:   Mon Aug 15 10:36:51 2016 +0800
>
> btrfs: relocation: Fix leaking qgroups numbers on data extents
>
> Sorry for that.
>
> And in v4.10, a better method is applied to fix the byte leaking
> problem, and should be a little faster than previous one.
>
> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
> Author: Qu Wenruo 
> Date:   Tue Oct 18 09:31:29 2016 +0800
>
> btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
>
>
> However, using balance with qgroup is still slower than balance without
> qgroup, the root fix needs us to rework current backref iteration.
>

 This patch has made the btrfs balance performance worse. The balance
 task has become more CPU intensive compared to earlier and takes longer
 to complete, besides hogging resources. While correctness is important,
 we need to figure out how this can be made more efficient.

>>> The cause is already known.
>>>
>>> It's find_parent_node() which takes most of the time to find all
>>> referencer
>>> of an extent.
>>>
>>> And it's also the cause for FIEMAP softlockup (fixed in recent release by
>>> early quit).
>>>
>>> The biggest problem is, current find_parent_node() uses list to iterate,
>>> which is quite slow especially it's done in a loop.
>>> In real world find_parent_node() is about O(n^3).
>>> We can either improve find_parent_node() by using rb_tree, or introduce
>>> some
>>> cache for find_parent_node().
>>
>>
>> Even if anyone is able to reduce that function's complexity from
>> O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
>> implementation of qgroups will always be a problem. The real problem
>> is that this more recent rework of qgroups does all this accounting
>> inside the critical section of a transaction - blocking any other
>> tasks that want to start a new transaction or attempt to join the
>> current transaction. Not to mention that on systems with small amounts
>> of memory (2Gb or 4Gb from what I've seen from user reports) we also
>> OOM due this allocation of struct btrfs_qgroup_extent_record per
>> delayed data reference head, that are used for that accounting phase
>> in the critical section of a transaction commit.
>>
>> Let's face it and be realistic, even if someone manages to make
>> find_parent_node() much much better, like O(n) for example, it will
>> always be a problem due to the reasons mentioned before. Many extents
>> touched per transaction and many subvolumes/snapshots, will always
>> expose that root problem - doing the accounting in the transaction
>> commit critical section.
>
>
> You must accept the fact that we must call find_parent_node() at least twice
> to get correct owner modification for each touched extent.
> Or qgroup number will never be correct.
>
> One for old_roots by searching commit root, and one for new_roots by
> searching current root.
>
> You can call find_parent_node() as many time as you like, but that's just
> wasting your CPU time.
>
> Only the final find_parent_node() will determine new_roots for that extent,
> and there is no better timing than commit_transaction().

You're missing my point.

My point is not about needing to call find_parent_nodes() nor how many
times to call it, or whether it's needed or not. My point is about
doing expensive things inside the critical section of a transaction
commit, which leads not only to low performance but getting a system
becoming unresponsive and 

Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Qu Wenruo



At 02/07/2017 11:55 PM, Filipe Manana wrote:

On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo  wrote:



At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:



Hi Qu,

On 02/05/2017 07:45 PM, Qu Wenruo wrote:




At 02/04/2017 09:47 AM, Jorg Bornschein wrote:


February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:








Quata support was indeed active -- and it warned me that the qroup
data was inconsistent.

Disabling quotas had an immediate impact on balance throughput -- it's
*much* faster now!
From a quick glance at iostat I would guess it's at least a factor 100
faster.


Should quota support generally be disabled during balances? Or did I
somehow push my fs into a weired state where it triggered a slow-path?



Thanks!

   j



Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely
fix qgroup bytes leaking, but also hugely slow down the balance process:

commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo 
Date:   Mon Aug 15 10:36:51 2016 +0800

btrfs: relocation: Fix leaking qgroups numbers on data extents

Sorry for that.

And in v4.10, a better method is applied to fix the byte leaking
problem, and should be a little faster than previous one.

commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo 
Date:   Tue Oct 18 09:31:29 2016 +0800

btrfs: qgroup: Fix qgroup data leaking by using subtree tracing


However, using balance with qgroup is still slower than balance without
qgroup, the root fix needs us to rework current backref iteration.



This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.


The cause is already known.

It's find_parent_node() which takes most of the time to find all referencer
of an extent.

And it's also the cause for FIEMAP softlockup (fixed in recent release by
early quit).

The biggest problem is, current find_parent_node() uses list to iterate,
which is quite slow especially it's done in a loop.
In real world find_parent_node() is about O(n^3).
We can either improve find_parent_node() by using rb_tree, or introduce some
cache for find_parent_node().


Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.

Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.


You must accept the fact that we must call find_parent_node() at least 
twice to get correct owner modification for each touched extent.

Or qgroup number will never be correct.

One for old_roots by searching commit root, and one for new_roots by 
searching current root.


You can call find_parent_node() as many time as you like, but that's 
just wasting your CPU time.


Only the final find_parent_node() will determine new_roots for that 
extent, and there is no better timing than commit_transaction().


Or you can wasting more time calling find_parent_node() every time you 
touched a extent, saving one find_parent_node() in commit_transaction() 
with the cost of more find_parent_node() in other place.

Is that what you want?

I can move the find_parent_node() for old_roots out of commit_transaction().
But that will only reduce 50% of the time spent on commit_transaction().

Compared to O(n^3) find_parent_node(), that's not the determining fact even.

Thanks,
Qu






IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but I
didn't hear anything about it recently.

Thanks,
Qu



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Austin S. Hemmelgarn

On 2017-02-07 14:47, Kai Krakow wrote:

Am Mon, 6 Feb 2017 08:19:37 -0500
schrieb "Austin S. Hemmelgarn" :


MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in
my btrfs array). This could write new blocks to the most idle disk
always. I think this wasn't covered by the above mentioned patch.
Currently, selection is based only on the disk with most free
space.

You're confusing read selection and write selection.  MDADM and
DM-RAID both use a load-balancing read selection algorithm that takes
latency and other factors into account.  However, they use a
round-robin write selection algorithm that only cares about the
position of the block in the virtual device modulo the number of
physical devices.


Thanks for clearing that point.


As an example, say you have a 3 disk RAID10 array set up using MDADM
(this is functionally the same as a 3-disk raid1 mode BTRFS
filesystem). Every third block starting from block 0 will be on disks
1 and 2, every third block starting from block 1 will be on disks 3
and 1, and every third block starting from block 2 will be on disks 2
and 3.  No latency measurements are taken, literally nothing is
factored in except the block's position in the virtual device.


I didn't know MDADM can use RAID10 on odd amounts of disks...
Nice. I'll keep that in mind. :-)
It's one of those neat features that I stumbled across by accident a 
while back that not many people know about.  It's kind of ironic when 
you think about it too, since the MD RAID10 profile with only 2 replicas 
is actually a more accurate comparison for the BTRFS raid1 profile than 
the MD RAID1 profile.  FWIW, it can (somewhat paradoxically) sometimes 
get better read and write performance than MD RAID0 across the same 
number of disks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Kai Krakow
Am Mon, 6 Feb 2017 08:19:37 -0500
schrieb "Austin S. Hemmelgarn" :

> > MDRAID uses stripe selection based on latency and other measurements
> > (like head position). It would be nice if btrfs implemented similar
> > functionality. This would also be helpful for selecting a disk if
> > there're more disks than stripesets (for example, I have 3 disks in
> > my btrfs array). This could write new blocks to the most idle disk
> > always. I think this wasn't covered by the above mentioned patch.
> > Currently, selection is based only on the disk with most free
> > space.  
> You're confusing read selection and write selection.  MDADM and
> DM-RAID both use a load-balancing read selection algorithm that takes
> latency and other factors into account.  However, they use a
> round-robin write selection algorithm that only cares about the
> position of the block in the virtual device modulo the number of
> physical devices.

Thanks for clearing that point.

> As an example, say you have a 3 disk RAID10 array set up using MDADM 
> (this is functionally the same as a 3-disk raid1 mode BTRFS
> filesystem). Every third block starting from block 0 will be on disks
> 1 and 2, every third block starting from block 1 will be on disks 3
> and 1, and every third block starting from block 2 will be on disks 2
> and 3.  No latency measurements are taken, literally nothing is
> factored in except the block's position in the virtual device.

I didn't know MDADM can use RAID10 on odd amounts of disks...
Nice. I'll keep that in mind. :-)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Filipe Manana
On Tue, Feb 7, 2017 at 12:22 AM, Qu Wenruo  wrote:
>
>
> At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:
>>
>>
>> Hi Qu,
>>
>> On 02/05/2017 07:45 PM, Qu Wenruo wrote:
>>>
>>>
>>>
>>> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:

 February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:
>>
>>
>> 
>>


 Quata support was indeed active -- and it warned me that the qroup
 data was inconsistent.

 Disabling quotas had an immediate impact on balance throughput -- it's
 *much* faster now!
 From a quick glance at iostat I would guess it's at least a factor 100
 faster.


 Should quota support generally be disabled during balances? Or did I
 somehow push my fs into a weired state where it triggered a slow-path?



 Thanks!

j
>>>
>>>
>>> Would you please provide the kernel version?
>>>
>>> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
>>> fix qgroup bytes leaking, but also hugely slow down the balance process:
>>>
>>> commit 62b99540a1d91e46422f0e04de50fc723812c421
>>> Author: Qu Wenruo 
>>> Date:   Mon Aug 15 10:36:51 2016 +0800
>>>
>>> btrfs: relocation: Fix leaking qgroups numbers on data extents
>>>
>>> Sorry for that.
>>>
>>> And in v4.10, a better method is applied to fix the byte leaking
>>> problem, and should be a little faster than previous one.
>>>
>>> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
>>> Author: Qu Wenruo 
>>> Date:   Tue Oct 18 09:31:29 2016 +0800
>>>
>>> btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
>>>
>>>
>>> However, using balance with qgroup is still slower than balance without
>>> qgroup, the root fix needs us to rework current backref iteration.
>>>
>>
>> This patch has made the btrfs balance performance worse. The balance
>> task has become more CPU intensive compared to earlier and takes longer
>> to complete, besides hogging resources. While correctness is important,
>> we need to figure out how this can be made more efficient.
>>
> The cause is already known.
>
> It's find_parent_node() which takes most of the time to find all referencer
> of an extent.
>
> And it's also the cause for FIEMAP softlockup (fixed in recent release by
> early quit).
>
> The biggest problem is, current find_parent_node() uses list to iterate,
> which is quite slow especially it's done in a loop.
> In real world find_parent_node() is about O(n^3).
> We can either improve find_parent_node() by using rb_tree, or introduce some
> cache for find_parent_node().

Even if anyone is able to reduce that function's complexity from
O(n^3) down to lets say O(n^2) or O(n log n) for example, the current
implementation of qgroups will always be a problem. The real problem
is that this more recent rework of qgroups does all this accounting
inside the critical section of a transaction - blocking any other
tasks that want to start a new transaction or attempt to join the
current transaction. Not to mention that on systems with small amounts
of memory (2Gb or 4Gb from what I've seen from user reports) we also
OOM due this allocation of struct btrfs_qgroup_extent_record per
delayed data reference head, that are used for that accounting phase
in the critical section of a transaction commit.

Let's face it and be realistic, even if someone manages to make
find_parent_node() much much better, like O(n) for example, it will
always be a problem due to the reasons mentioned before. Many extents
touched per transaction and many subvolumes/snapshots, will always
expose that root problem - doing the accounting in the transaction
commit critical section.

>
>
> IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but I
> didn't hear anything about it recently.
>
> Thanks,
> Qu
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-06 Thread Qu Wenruo



At 02/07/2017 12:09 AM, Goldwyn Rodrigues wrote:


Hi Qu,

On 02/05/2017 07:45 PM, Qu Wenruo wrote:



At 02/04/2017 09:47 AM, Jorg Bornschein wrote:

February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:







Quata support was indeed active -- and it warned me that the qroup
data was inconsistent.

Disabling quotas had an immediate impact on balance throughput -- it's
*much* faster now!
From a quick glance at iostat I would guess it's at least a factor 100
faster.


Should quota support generally be disabled during balances? Or did I
somehow push my fs into a weired state where it triggered a slow-path?



Thanks!

   j


Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely
fix qgroup bytes leaking, but also hugely slow down the balance process:

commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo 
Date:   Mon Aug 15 10:36:51 2016 +0800

btrfs: relocation: Fix leaking qgroups numbers on data extents

Sorry for that.

And in v4.10, a better method is applied to fix the byte leaking
problem, and should be a little faster than previous one.

commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo 
Date:   Tue Oct 18 09:31:29 2016 +0800

btrfs: qgroup: Fix qgroup data leaking by using subtree tracing


However, using balance with qgroup is still slower than balance without
qgroup, the root fix needs us to rework current backref iteration.



This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.


The cause is already known.

It's find_parent_node() which takes most of the time to find all 
referencer of an extent.


And it's also the cause for FIEMAP softlockup (fixed in recent release 
by early quit).


The biggest problem is, current find_parent_node() uses list to iterate, 
which is quite slow especially it's done in a loop.

In real world find_parent_node() is about O(n^3).
We can either improve find_parent_node() by using rb_tree, or introduce 
some cache for find_parent_node().



IIRC SUSE guys(maybe Jeff?) are working on it with the first method, but 
I didn't hear anything about it recently.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-06 Thread Goldwyn Rodrigues

Hi Qu,

On 02/05/2017 07:45 PM, Qu Wenruo wrote:
> 
> 
> At 02/04/2017 09:47 AM, Jorg Bornschein wrote:
>> February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:



>>
>>
>> Quata support was indeed active -- and it warned me that the qroup
>> data was inconsistent.
>>
>> Disabling quotas had an immediate impact on balance throughput -- it's
>> *much* faster now!
>> From a quick glance at iostat I would guess it's at least a factor 100
>> faster.
>>
>>
>> Should quota support generally be disabled during balances? Or did I
>> somehow push my fs into a weired state where it triggered a slow-path?
>>
>>
>>
>> Thanks!
>>
>>j
> 
> Would you please provide the kernel version?
> 
> v4.9 introduced a bad fix for qgroup balance, which doesn't completely
> fix qgroup bytes leaking, but also hugely slow down the balance process:
> 
> commit 62b99540a1d91e46422f0e04de50fc723812c421
> Author: Qu Wenruo 
> Date:   Mon Aug 15 10:36:51 2016 +0800
> 
> btrfs: relocation: Fix leaking qgroups numbers on data extents
> 
> Sorry for that.
> 
> And in v4.10, a better method is applied to fix the byte leaking
> problem, and should be a little faster than previous one.
> 
> commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
> Author: Qu Wenruo 
> Date:   Tue Oct 18 09:31:29 2016 +0800
> 
> btrfs: qgroup: Fix qgroup data leaking by using subtree tracing
> 
> 
> However, using balance with qgroup is still slower than balance without
> qgroup, the root fix needs us to rework current backref iteration.
> 

This patch has made the btrfs balance performance worse. The balance
task has become more CPU intensive compared to earlier and takes longer
to complete, besides hogging resources. While correctness is important,
we need to figure out how this can be made more efficient.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-06 Thread Austin S. Hemmelgarn

On 2017-02-04 16:10, Kai Krakow wrote:

Am Sat, 04 Feb 2017 20:50:03 +
schrieb "Jorg Bornschein" :


February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
wrote:


Yes, please check if disabling quotas makes a difference in
execution time of btrfs balance.


Just FYI: With quotas disabled it took ~20h to finish the balance
instead of the projected >30 days. Therefore, in my case, there was a
speedup of factor ~35.


and thanks for the quick reply! (and for btrfs general!)


BTW: I'm wondering how much sense it makes to activate the underlying
bcache for my raid1 fs again. I guess btrfs chooses randomly (or
based predicted of disk latency?) which copy of a given extend to
load?


As far as I know, it uses PID modulo only currently, no round-robin,
no random value. There are no performance optimizations going into btrfs
yet because there're still a lot of ongoing feature implementations.

I think there were patches to include a rotator value in the stripe
selection. They don't apply to the current kernel. I tried it once and
didn't see any subjective difference for normal desktop workloads. But
that's probably because I use RAID1 for metadata only.
I had tested similar patches myself using raid1 for everything, and saw 
near zero improvement unless I explicitly tried to create a worst-case 
performance situation.  The reality is that the current algorithm is 
actually remarkably close to being optimal for most use cases while 
using an insanely small amount of processing power and memory compared 
to an optimal algorithm (and a truly optimal algorithm is in fact 
functionally impossible in almost all cases because it would require 
predicting the future).


MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in my
btrfs array). This could write new blocks to the most idle disk always.
I think this wasn't covered by the above mentioned patch. Currently,
selection is based only on the disk with most free space.
You're confusing read selection and write selection.  MDADM and DM-RAID 
both use a load-balancing read selection algorithm that takes latency 
and other factors into account.  However, they use a round-robin write 
selection algorithm that only cares about the position of the block in 
the virtual device modulo the number of physical devices.


As an example, say you have a 3 disk RAID10 array set up using MDADM 
(this is functionally the same as a 3-disk raid1 mode BTRFS filesystem). 
 Every third block starting from block 0 will be on disks 1 and 2, 
every third block starting from block 1 will be on disks 3 and 1, and 
every third block starting from block 2 will be on disks 2 and 3.  No 
latency measurements are taken, literally nothing is factored in except 
the block's position in the virtual device.


Now, that said, BTRFS does behave differently under the same 
circumstances, but this is because the striping is different for BTRFS. 
It happens at the chunk level instead of the block level.  If we look at 
an example using the same 3 devices as the MDADM example, and then for 
simplicity assume that you end up allocating alternating data and 
metadata chunks, things might look a bit like this:

* System chunk: Device 1 and 2
* Metadata chunk 0: Device 3 and 1
* Data chunk 0: Device 2 and 3
* Metadata chunk 1: Device 1 and 2
* Data chunk 1: Device 1 and 2
Overall, there is technically a pattern, but it's got a very long 
repetition period.  This is still however a near optimal allocation 
pattern given the constraints.  It also gives (just like the MDADM and 
DM-RAID method) 100% deterministic behavior, the only difference is it 
depends on a slightly different factor.  Changing this to select the 
most idle disk as you suggest would remove that determinism, increase 
the likelihood of sub-optimal layouts in terms of space usage, increase 
the number of cases where you could get ENOSPC, and provide near zero 
net performance benefit except under heavy load.  IOW, it would provide 
a pretty negative net benefit.


What actually needs to happen to improve write performance is that BTRFS 
needs to quit serializing writes when writing chunks across multiple 
devices.  In the case of a raid1 setup, it writes first to one device, 
then the other, alternating back and forth as it updates each extent. 
This combined with the COW behavior causing write amplification is what 
makes write performance so horrible for BTRFS compared to MDADM or 
DM-RAID.  It's not that we have bad device selection for writes, it's 
that we don't even try to do any kind of practical parallelization 
despite it being an embarrassingly parallel task (and yes, that 
seriously is what something that's trivial to parallelize is called in 
scientific papers...).

--
To unsubscribe from 

Re: Very slow balance / btrfs-transaction

2017-02-06 Thread Qu Wenruo



At 02/06/2017 05:14 PM, Jorg Bornschein wrote:

February 6, 2017 1:45 AM, "Qu Wenruo" 


Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely fix 
qgroup bytes leaking,
but also hugely slow down the balance process:



I'm a bit behind the times: 4.8.13-1-ARCH



   j



Unfortunately, v4.8 also has that bad commit :(.

So if you have your spare time, you could try v4.10.
Although for Archlinux it would take some time before v4.10 moved from 
[testing] to [core].


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-06 Thread Jorg Bornschein
February 6, 2017 1:45 AM, "Qu Wenruo"  

> Would you please provide the kernel version?
> 
> v4.9 introduced a bad fix for qgroup balance, which doesn't completely fix 
> qgroup bytes leaking,
> but also hugely slow down the balance process:
>

I'm a bit behind the times: 4.8.13-1-ARCH



   j
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-05 Thread Qu Wenruo



At 02/04/2017 09:47 AM, Jorg Bornschein wrote:

February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:


On 02/03/2017 06:30 PM, Jorg Bornschein wrote:


February 3, 2017 11:26 PM, "Goldwyn Rodrigues"  wrote:

Hi,

I'm currently running a balance (without any filters) on a 4 drives raid1 
filesystem. The array
contains 3 3TB drives and one 6TB drive; I'm running the rebalance because the 
6TB drive recently
replaced a 2TB drive.

I know that balance is not supposed to be a fast operation, but this one is now 
running for ~6 days
and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 
considered), 82% left)
-- so I expect it to take another ~4 weeks.

That seems excessively slow for ~8TiB of data.

Is this expected behavior? In case it's not: Is there anything I can do to help 
debug it?

Do you have quotas enabled?


I might have activated it when playing with "snapper" -- I remember using some 
quota command
without knowing what it does.

How can I check its active? Shall I just disable it wit "btrfs quota disable"?


To check your quota limits:
# btrfs qgroup show 

To disable
# btrfs quota disable 

Yes, please check if disabling quotas makes a difference in execution
time of btrfs balance.



Quata support was indeed active -- and it warned me that the qroup data was 
inconsistent.

Disabling quotas had an immediate impact on balance throughput -- it's *much* 
faster now!
From a quick glance at iostat I would guess it's at least a factor 100 faster.


Should quota support generally be disabled during balances? Or did I somehow 
push my fs into a weired state where it triggered a slow-path?



Thanks!

   j


Would you please provide the kernel version?

v4.9 introduced a bad fix for qgroup balance, which doesn't completely 
fix qgroup bytes leaking, but also hugely slow down the balance process:


commit 62b99540a1d91e46422f0e04de50fc723812c421
Author: Qu Wenruo 
Date:   Mon Aug 15 10:36:51 2016 +0800

btrfs: relocation: Fix leaking qgroups numbers on data extents

Sorry for that.

And in v4.10, a better method is applied to fix the byte leaking 
problem, and should be a little faster than previous one.


commit 824d8dff8846533c9f1f9b1eabb0c03959e989ca
Author: Qu Wenruo 
Date:   Tue Oct 18 09:31:29 2016 +0800

btrfs: qgroup: Fix qgroup data leaking by using subtree tracing


However, using balance with qgroup is still slower than balance without 
qgroup, the root fix needs us to rework current backref iteration.


Thanks,
Qu


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-04 Thread Kai Krakow
Am Sat, 04 Feb 2017 20:50:03 +
schrieb "Jorg Bornschein" :

> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
> wrote:
> 
> > Yes, please check if disabling quotas makes a difference in
> > execution time of btrfs balance.  
> 
> Just FYI: With quotas disabled it took ~20h to finish the balance
> instead of the projected >30 days. Therefore, in my case, there was a
> speedup of factor ~35.
> 
> 
> and thanks for the quick reply! (and for btrfs general!)
> 
> 
> BTW: I'm wondering how much sense it makes to activate the underlying
> bcache for my raid1 fs again. I guess btrfs chooses randomly (or
> based predicted of disk latency?) which copy of a given extend to
> load?

As far as I know, it uses PID modulo only currently, no round-robin,
no random value. There are no performance optimizations going into btrfs
yet because there're still a lot of ongoing feature implementations.

I think there were patches to include a rotator value in the stripe
selection. They don't apply to the current kernel. I tried it once and
didn't see any subjective difference for normal desktop workloads. But
that's probably because I use RAID1 for metadata only.

MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in my
btrfs array). This could write new blocks to the most idle disk always.
I think this wasn't covered by the above mentioned patch. Currently,
selection is based only on the disk with most free space.

> I guess that would mean the effective cache size would only be
> half of the actual cache-set size (+-additional overhead)? Or does
> btrfs try a deterministically determined copy of each extend first? 

I'm currently using 500GB bcache, it helps a lot during system start -
and probably also while using using the system. I think that bcache
mostly caches metadata access which should improve a lot of btrfs
performance issues. The downside of RAID1 profile is, that probably
every second access is a cache-miss unless it has already been cached.
Thus, it's only half-effective as it could be.

I'm using write-back bcache caching, and RAID0 for data (I do daily
backups with borgbackup, I can easily recover broken files). So
writing with bcache is not such an issue for me. The cache is big
enough that double metadata writes are no problem.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-04 Thread Jorg Bornschein
February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:

> Yes, please check if disabling quotas makes a difference in execution
> time of btrfs balance.

Just FYI: With quotas disabled it took ~20h to finish the balance instead of 
the projected >30 days. Therefore, in my case, there was a speedup of factor 
~35.


and thanks for the quick reply! (and for btrfs general!)


BTW: I'm wondering how much sense it makes to activate the underlying bcache 
for my raid1 fs again. I guess btrfs chooses randomly (or based predicted of 
disk latency?) which copy of a given extend to load? I guess that would mean 
the effective cache size would only be half of the actual cache-set size 
(+-additional overhead)? Or does btrfs try a deterministically determined copy 
of each extend first? 



   j
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-04 Thread Duncan
Lakshmipathi.G posted on Sat, 04 Feb 2017 08:25:04 +0530 as excerpted:

>>Should quota support generally be disabled during balances?
> 
> If this true and quota impacts balance throughput, at-least there should
> an alert message like "Running Balance with quota will affect
> performance" or similar before starting.

The problem isn't that, exactly, tho that's part of it.  The problem with 
quotas is that the feature itself isn't yet mature.  At least until very 
recently, and possibly still, quotas couldn't be depended upon to work 
correctly (various not entirely uncommon corner-cases would trigger 
negative numbers, etc), and even when they do work correctly, they simply 
don't scale well in combination with balance, check, etc -- that 10X 
difference isn't uncommon.

So my recommendation for quotas has been and remains, unless you're 
actively working with the devs on improving them, it's probably better to 
keep them disabled.  Either you actually need quota functionality or you 
don't.  If you do, it's better to use a mature filesystem where quotas 
are a mature feature that works dependably.  If you don't, just leave the 
feature off, as it continues to simply not be worth the troubles and 
scaling issues it triggers.

IOW, btrfs quotas might work and scale well some day, but that day isn't 
today, and it's not going to be tomorrow or next kernel cycle, either.  
It's going to take awhile, and you'll be much happier with btrfs in the 
mean time if you don't have them enabled.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-03 Thread Lakshmipathi.G
>Should quota support generally be disabled during balances?

If this true and quota impacts balance throughput, at-least there
should an alert message like "Running Balance with quota will affect
performance" or similar before starting.


Cheers,
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-03 Thread Jorg Bornschein
February 4, 2017 1:07 AM, "Goldwyn Rodrigues"  wrote:

> On 02/03/2017 06:30 PM, Jorg Bornschein wrote:
> 
>> February 3, 2017 11:26 PM, "Goldwyn Rodrigues"  wrote:
>> 
>> Hi,
>> 
>> I'm currently running a balance (without any filters) on a 4 drives raid1 
>> filesystem. The array
>> contains 3 3TB drives and one 6TB drive; I'm running the rebalance because 
>> the 6TB drive recently
>> replaced a 2TB drive.
>> 
>> I know that balance is not supposed to be a fast operation, but this one is 
>> now running for ~6 days
>> and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 
>> considered), 82% left)
>> -- so I expect it to take another ~4 weeks.
>> 
>> That seems excessively slow for ~8TiB of data.
>> 
>> Is this expected behavior? In case it's not: Is there anything I can do to 
>> help debug it?
>>> Do you have quotas enabled?
>> 
>> I might have activated it when playing with "snapper" -- I remember using 
>> some quota command
>> without knowing what it does.
>> 
>> How can I check its active? Shall I just disable it wit "btrfs quota 
>> disable"?
> 
> To check your quota limits:
> # btrfs qgroup show 
> 
> To disable
> # btrfs quota disable 
> 
> Yes, please check if disabling quotas makes a difference in execution
> time of btrfs balance.


Quata support was indeed active -- and it warned me that the qroup data was 
inconsistent. 

Disabling quotas had an immediate impact on balance throughput -- it's *much* 
faster now! 
>From a quick glance at iostat I would guess it's at least a factor 100 faster.


Should quota support generally be disabled during balances? Or did I somehow 
push my fs into a weired state where it triggered a slow-path?



Thanks!   
   
   j
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-03 Thread Goldwyn Rodrigues


On 02/03/2017 06:30 PM, Jorg Bornschein wrote:
> February 3, 2017 11:26 PM, "Goldwyn Rodrigues"  wrote:
> 
>>> Hi,
>>>
>>> I'm currently running a balance (without any filters) on a 4 drives raid1 
>>> filesystem. The array
>>> contains 3 3TB drives and one 6TB drive; I'm running the rebalance because 
>>> the 6TB drive recently
>>> replaced a 2TB drive.
>>>
>>> I know that balance is not supposed to be a fast operation, but this one is 
>>> now running for ~6 days
>>> and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 
>>> considered), 82% left)
>>> -- so I expect it to take another ~4 weeks.
>>>
>>> That seems excessively slow for ~8TiB of data.
>>>
>>> Is this expected behavior? In case it's not: Is there anything I can do to 
>>> help debug it?
>>
>> Do you have quotas enabled?
> 
> 
> I might have activated it when playing with "snapper" -- I remember using 
> some quota command without knowing what it does. 
> 
> How can I check its active? Shall I just disable it wit "btrfs quota 
> disable"? 
> 

To check your quota limits:
# btrfs qgroup show 

To disable
# btrfs quota disable 

Yes, please check if disabling quotas makes a difference in execution
time of btrfs balance.

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-03 Thread Jorg Bornschein
February 3, 2017 11:26 PM, "Goldwyn Rodrigues"  wrote:

>> Hi,
>> 
>> I'm currently running a balance (without any filters) on a 4 drives raid1 
>> filesystem. The array
>> contains 3 3TB drives and one 6TB drive; I'm running the rebalance because 
>> the 6TB drive recently
>> replaced a 2TB drive.
>> 
>> I know that balance is not supposed to be a fast operation, but this one is 
>> now running for ~6 days
>> and it managed to balance ~18% (754 out of about 4250 chunks balanced (755 
>> considered), 82% left)
>> -- so I expect it to take another ~4 weeks.
>> 
>> That seems excessively slow for ~8TiB of data.
>> 
>> Is this expected behavior? In case it's not: Is there anything I can do to 
>> help debug it?
> 
> Do you have quotas enabled?


I might have activated it when playing with "snapper" -- I remember using some 
quota command without knowing what it does. 

How can I check its active? Shall I just disable it wit "btrfs quota disable"? 


   j
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Very slow balance / btrfs-transaction

2017-02-03 Thread Goldwyn Rodrigues


On 02/03/2017 04:13 PM, j...@capsec.org wrote:
> Hi, 
> 
> 
> I'm currently running a balance (without any filters) on a 4 drives raid1 
> filesystem. The array contains 3 3TB drives and one 6TB drive; I'm running 
> the rebalance because the 6TB drive recently replaced a 2TB drive. 
> 
> 
> I know that balance is not supposed to be a fast operation, but this one is 
> now running for ~6 days and it managed to balance ~18% (754 out of about 4250 
> chunks balanced (755 considered),  82% left) -- so I expect it to take 
> another ~4 weeks. 
> 
> That seems excessively slow for ~8TiB of data.
> 
> 
> Is this expected behavior? In case it's not: Is there anything I can do to 
> help debug it?

Do you have quotas enabled?

-- 
Goldwyn
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Very slow balance / btrfs-transaction

2017-02-03 Thread jb
Hi, 


I'm currently running a balance (without any filters) on a 4 drives raid1 
filesystem. The array contains 3 3TB drives and one 6TB drive; I'm running the 
rebalance because the 6TB drive recently replaced a 2TB drive. 


I know that balance is not supposed to be a fast operation, but this one is now 
running for ~6 days and it managed to balance ~18% (754 out of about 4250 
chunks balanced (755 considered),  82% left) -- so I expect it to take another 
~4 weeks. 

That seems excessively slow for ~8TiB of data.


Is this expected behavior? In case it's not: Is there anything I can do to help 
debug it?


The 4 individual devices are bcache devices with currently no ssd cache 
partition attached; the bcache backing devices sit ontop of luks encrypted 
devices. Maybe a few words about the history of this fs: It used to be a 1 
drive btrfs ontop of a bcache partition with a 30GiB SSD cache (actively used 
for >1 year). During the last month, I gradually added devices (always with 
active bcaches). At some point, after adding the 4th device, I deactivated 
(detached) the bcache caching device and instead activated raid1 for data and 
metadata and ran a rebalance (which was reasonably fast -- I don't remember how 
fast exactly, but probably <24h). The finaly steps that lead to the current 
situation: I activated "nossd" and replaced the smallest device with "btrfs dev 
replace" (which was also reasonabley fast, <12h).

 

Best & thanks, 

   j


--
[joerg@dorsal ~]$ lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda   8:00 111.8G  0 disk
├─sda18:10 1G  0 part  /boot
└─sda28:20 110.8G  0 part
  └─crypted 254:00 110.8G  0 crypt
├─ssd-root  254:10  72.8G  0 lvm   /
├─ssd-swap  254:20 8G  0 lvm   [SWAP]
└─ssd-cache 254:3030G  0 lvm
sdb   8:16   0   2.7T  0 disk
└─sdb18:17   0   2.7T  0 part
  └─crypted-sdb 254:70   2.7T  0 crypt
└─bcache2   253:20   2.7T  0 disk
sdc   8:32   0   2.7T  0 disk
└─sdc18:33   0   2.7T  0 part
  └─crypted-sdc 254:40   2.7T  0 crypt
└─bcache1   253:10   2.7T  0 disk
sdd   8:48   0   2.7T  0 disk
└─sdd18:49   0   2.7T  0 part
  └─crypted-sdd 254:60   2.7T  0 crypt
└─bcache0   253:00   2.7T  0 disk
sde   8:64   0   5.5T  0 disk
└─sde18:65   0   5.5T  0 part
  └─crypted-sde 254:50   5.5T  0 crypt
└─bcache3   253:30   5.5T  0 disk  /storage
--

joerg@dorsal ~]$ sudo btrfs  fi usage -h /storage/
Overall:
Device size:  13.64TiB
Device allocated:  8.35TiB
Device unallocated:5.29TiB
Device missing:  0.00B
Used:  8.34TiB
Free (estimated):  2.65TiB  (min: 2.65TiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  512.00MiB  (used: 15.77MiB)

Data,RAID1: Size:4.17TiB, Used:4.16TiB
   /dev/bcache02.38TiB
   /dev/bcache12.37TiB
   /dev/bcache22.38TiB
   /dev/bcache31.20TiB

Metadata,RAID1: Size:9.00GiB, Used:7.49GiB
   /dev/bcache18.00GiB
   /dev/bcache21.00GiB
   /dev/bcache39.00GiB

System,RAID1: Size:32.00MiB, Used:624.00KiB
   /dev/bcache1   32.00MiB
   /dev/bcache3   32.00MiB

Unallocated:
   /dev/bcache0  355.52GiB
   /dev/bcache1  356.49GiB
   /dev/bcache2  355.52GiB
   /dev/bcache34.25TiB
  
--
[joerg@dorsal ~]$ ps -xal | grep btrfs
1 0   227 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-worker]
1 0   229 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-worker-hi]
1 0   230 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-delalloc]
1 0   231 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-flush_del]
1 0   232 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-cache]
1 0   233 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-submit]
1 0   234 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-fixup]
1 0   235 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-endio]
1 0   236 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-endio-met]
1 0   237 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-endio-met]
1 0   238 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-endio-rai]
1 0   239 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-endio-rep]
1 0   240 2   0 -20  0 0 -  S<   ?  0:00 [btrfs-rmw]
1 0   241 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-endio-wri]
1 0   242 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-freespace]
1 0   243 2   0 -20  0 0 -  S<   ?  0:00 
[btrfs-delayed-m]
1 0   244 2   0 -20  0