btrfs-cleaner / snapshot performance analysis

2018-02-09 Thread Ellis H. Wilson III

Hi all,

I am trying to better understand how the cleaner kthread (btrfs-cleaner) 
impacts foreground performance, specifically during snapshot deletion. 
My experience so far has been that it can be dramatically disruptive to 
foreground I/O.


Looking through the wiki at kernel.org I have not yet stumbled onto any 
analysis that would shed light on this specific problem.  I have found 
numerous complaints about btrfs-cleaner online, especially relating to 
quotas being enabled.  This has proven thus far less than helpful, as 
the response tends to be "use less snapshots," or "disable quotas," both 
of which strike me as intellectually unsatisfying answers, especially 
the former in a filesystem where snapshots are supposed to be 
"first-class citizens."


The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot 
performance analysis I would expect to see given the assertions in the 
latter that "BTRFS...supports efficient snapshots..."  The former is 
sufficiently pre-BTRFS that while it does performance analysis of btree 
clones, it's unclear (to me at least) if the results can be 
forward-propagated in some way to real-world performance expectations 
for BTRFS snapshot creation/deletion/modification.


Has this analysis been performed somewhere else and I'm just missing it? 
 Also, I'll be glad to comment on my specific setup, kernel version, 
etc, and discuss pragmatic work-arounds, but I'd like to better 
understand the high-level performance implications first.


Thanks in advance to anyone who can comment on this.  I am very inclined 
to read anything thrown at me, so if there is documentation I failed to 
read, please just send the link.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-10 Thread Ellis H. Wilson III
Thank you very much for your response Hans.  Comments in-line, but I did 
want to handle one miscommunication straight-away:


I'm a huge fan of BTRFS.  If I came off like I was complaining, my 
sincere apologies.   To be completely transparent we are using BTRFS in 
a very large project at my company, which I am lead architect on, and 
while I have read the academic papers, perused a subset of the source 
code, and been following it's development in the background, I now need 
to deeply understand where there might be performance hiccups.  All of 
our foreground I/O testing with BTRFS in RAID0/RAID1/single across 
different SSDs and HDDs has been stellar, but we haven't dug too far 
into snapshot performance, balancing, and other more background-oriented 
performance.  Hence my interest in finding documentation and analysis I 
can read and grok myself on the implications of snapshot operations on 
foreground I/O if such exists.  More in-line below:


On 02/09/2018 03:36 PM, Hans van Kranenburg wrote:

This has proven thus far less than helpful, as
the response tends to be "use less snapshots," or "disable quotas," both
of which strike me as intellectually unsatisfying answers


Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
didn't even realize I had those, and now the problem is gone."


I meant less than helpful for me, since for my project I need detailed 
and fairly accurate capacity information per sub-volume, and the 
relationship between qgroups and subvolume performance wasn't being 
spelled out in the responses.  Please correct me if I am wrong about 
needing qgroups enabled to see detailed capacity information 
per-subvolume (including snapshots).



the former in a filesystem where snapshots are supposed to be
"first-class citizens."


Throwing complaints around is also not helpful.


Sorry about this.  It wasn't directed in any way at BTRFS developers, 
but rather some of the suggestions for solution proposed in random 
forums online.  As mentioned I'm a fan of BTRFS, especially as my 
project requires the snapshots to truly be first-class citizens in that 
they are writable and one can roll-back to them at-will, unlike in ZFS 
and other filesystems.  I was just saying it seemed backwards to suggest 
having less snapshots was a solution in a filesystem where the 
architecture appears to treat them as a core part of the design.



The "performance implications" are highly dependent on your specific
setup, kernel version, etc, so it really makes sense to share:

* kernel version
* mount options (from /proc/mounts|grep btrfs)
* is it ssd? hdd? iscsi lun?
* how big is the FS
* how many subvolumes/snapshots? (how many snapshots per subvolume)


I will answer the above, but would like to reiterate my previous comment 
that I still would like to understand the fundamental relationships here 
as in my project kernel version is very likely to change (to more 
recent), along with mount options and underlying device media.  Once 
this project hits the field I will additionally have limited control 
over how large the FS gets (until physical media space is exhausted of 
course) or how many subvolumes/snapshots there are.  If I know that 
above N snapshots per subvolume performance tanks by M%, I can apply 
limits on the use-case in the field, but I am not aware of those kinds 
of performance implications yet.


My present situation is the following:
- Fairly default opensuse 42.3.
- uname -a: Linux betty 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 
2018 (7db1912) x86_64 x86_64 x86_64 GNU/Linux
- /dev/sda6 / btrfs 
rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot 0 0
(I have about 10 other btrfs subvolumes, but this is the only one being 
snapshotted)
- At the time of my noticing the slow-down, I had about 24 snapshots, 10 
of which were in the process of being deleted

- Usage output:
~> sudo btrfs filesystem usage /
Overall:
Device size:  40.00GiB
Device allocated: 11.54GiB
Device unallocated:   28.46GiB
Device missing:  0.00B
Used:  7.57GiB
Free (estimated): 32.28GiB  (min: 32.28GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:   28.44MiB  (used: 0.00B)
Data,single: Size:11.01GiB, Used:7.19GiB
   /dev/sda6  11.01GiB
Metadata,single: Size:512.00MiB, Used:395.91MiB
   /dev/sda6 512.00MiB
System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda6  32.00MiB
Unallocated:
   /dev/sda6  28.46GiB


And what's essential to look at is what your computer is doing while you
are throwing a list of subvolumes into the cleaner.

* is it using 100% cpu?
* is it showing 100% disk read I/O utilization?
* is it showing 100% disk write I/O utilization? (is it writing lots and
lots of data to disk?)


I noticed the problem when Thunderbird became completely unresponsive

Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Ellis H. Wilson III

Thanks Tomasz,

Comments in-line:

On 02/10/2018 05:05 PM, Tomasz Pala wrote:

You won't have anything close to "accurate" in btrfs - quotas don't
include space wasted by fragmentation, which happens to allocate from tens
to thousands times (sic!) more space than the files itself.
Not in some worst-case scenarios, but in real life situations...
I got 10 MB db-file which was eating 10 GB of space after a week of
regular updates - withOUT snapshotting it. All described here.


The underlying filesystem this is replacing was an in-house developed 
COW filesystem, so we're aware of the difficulties of fragmentation. 
I'm more interested in an approximate space consumed across snapshots 
when considering CoW.  I realize it will be approximate.  Approximate is 
ok for us -- no accounting for snapshot space consumed is not.


Also, I don't see the thread you mentioned.  Perhaps you forgot to 
mention it, or an html link didn't come through properly?



course) or how many subvolumes/snapshots there are.  If I know that
above N snapshots per subvolume performance tanks by M%, I can apply
limits on the use-case in the field, but I am not aware of those kinds
of performance implications yet.


This doesn't work like this. It all depends on data that are subject of
snapshots, especially how they are updated. How exactly, including write
patterns.

I think you expect answers that can't be formulated - with fs architecture so
advanced as ZFS or btrfs it's behavior can't be analyzed for simple
answers like 'keep less than N snapshots'.


I was using an extremely simple heuristic to drive at what I was looking 
to get out of this.  I should have been more explicit that the example 
was not to be taken literally.



This is an exception of easy-answer: btrfs doesn't handle databases with
CoW. Period. Doesn't matter if snapshotted or not, ANY database files
(systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
slow down entire system to the speed of cheap SD card.


I will keep this in mind, thank you.  We do have a higher level above 
BTRFS that stages data.  I will consider implementing an algorithm to 
add the nocow flag to the file if it has been written to sufficiently to 
indicate it will be a bad fit for the BTRFS COW algorithm.



Actually, if you do not use compression and don't need checksums of data
blocks, you may want to mount all the btrfs with nocow by default.
This way the quotas would be more accurate (no fragmentation _between_
snapshots) and you'll have some decent performance with snapshots.
If that is all you care.


CoW is still valuable for us as we're shooting to support on the order 
of hundreds of snapshots per subvolume, and without it (if BTRFS COW 
works the same as our old COW FS) that's going to be quite expensive to 
keep snapshots around.  So some hybrid solution is required here.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Ellis H. Wilson III
Thanks Hans.  Sorry for the top-post, but I'm boiling things down here 
so I don't have a clear line-item to respond to.  The take-aways I see 
here to my original queries are:


1. Nobody has done a thorough analysis of the impact of snapshot 
manipulation WITHOUT qgroups enabled on foreground I/O performance
2. Nobody has done a thorough analysis of the impact of snapshot 
manipulation WITH qgroups enabled on foreground I/O performance
3. I need to look at the code to understand the interplay between 
qgroups, snapshots, and foreground I/O performance as there isn't 
existing architecture documentation to point me to that covers this
4. I should be cautioned that CoW in BTRFS can exhibit pathological (if 
expected) capacity consumption for very random-write-oriented datasets 
with or without snapshots, and nocow (or in my case transparently 
absorbing and coalescing writes at a higher tier) is my friend.
5. I should be cautioned that CoW is broken across snapshots when 
defragmentation is run.


I will update a test system to the most recent kernel and will perform 
tests to answer #1 and #2.  I will plan to share it when I'm done.  If I 
have time to write-up my findings for #3 I will similarly share that.


Thanks to all for your input on this issue.

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/11/2018 01:03 PM, Hans van Kranenburg wrote:

3. I need to look at the code to understand the interplay between
qgroups, snapshots, and foreground I/O performance as there isn't
existing architecture documentation to point me to that covers this


Well, the excellent write-up of Qu this morning shows some explanation
from the design point of view.


Sorry, I may have missed this email.  Or perhaps you are referring to a 
wiki or blog post of some kind I'm not following actively?  Either way, 
if you can forward me the link, I'd greatly appreciate it.



nocow only keeps the cows on a distance as long as you don't start
snapshotting (or cp --reflink) those files... If you take a snapshot,
then you force btrfs to keep the data around that is referenced by the
snapshot. So, that means that every next write will be cowed once again,
moo, so small writes will be redirected to a new location, causing
fragmentation again. The second and third write can go in the same (new)
location of the first new write, but as soon as you snapshot again, this
happens again.


Ah, very interesting.  Thank you for clarifying!

Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:

Why not just use `btrfs fi du   ` now and then and
update your administration with the results? .. Instead of putting the
burden of keeping track of all administration during every tiny change
all day long?


I will look into that if using built-in group capacity functionality 
proves to be truly untenable.  Thanks!



CoW is still valuable for us as we're shooting to support on the order
of hundreds of snapshots per subvolume,


Hundreds will get you into trouble even without qgroups.


I should have been more specific.  We are looking to use up to a few 
dozen snapshots per subvolume, but will have many (tens to hundreds of) 
discrete subvolumes (each with up to a few dozen snapshots) in a BTRFS 
filesystem.  If I have it wrong and the scalability issues in BTRFS do 
not solely apply to subvolumes and their snapshot counts, please let me 
know.


I will note you focused on my tiny desktop filesystem when making some 
of your previous comments -- this is why I didn't want to share specific 
details.  Our filesystem will be RAID0 with six large HDDs (12TB each). 
Reliability concerns do not apply to our situation for technical 
reasons, but if there are capacity scaling issues with BTRFS I should be 
made aware of, I'd be glad to hear them.  I have not seen any in 
technical documentation of such a limit, and experiments so far on 6x6TB 
arrays has not shown any performance problems, so I'm inclined to 
believe the only scaling issue exists with reflinks.  Correct me if I'm 
wrong.


Thanks,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/12/2018 11:02 AM, Austin S. Hemmelgarn wrote:
I will look into that if using built-in group capacity functionality 
proves to be truly untenable.  Thanks!
As a general rule, unless you really need to actively prevent a 
subvolume from exceeding it's quota, this will generally be more 
reliable and have much less performance impact than using qgroups.


Ok ok :).  I will plan to go this route, but since I'll want to 
benchmark it either way, I'll include qgroups enabled in the benchmark 
and will report back.


With qgroups involved, I really can't say for certain, as I've never 
done much with them myself, but based on my understanding of how it all 
works, I would expect multiple subvolumes with a small number of 
snapshots each to not have as many performance issues as a single 
subvolume with the same total number of snapshots.


Glad to hear that.  That was my expectation as well.

BTRFS in general works fine at that scale, dependent of course on the 
level of concurrent access you need to support.  Each tree update needs 
to lock a bunch of things in the tree itself, and having large numbers 
of clients writing to the same set of files concurrently can cause lock 
contention issues because of this, especially if all of them are calling 
fsync() or fdatasync() regularly.  These issues can be mitigated by 
segregating workloads into their own subvolumes (each subvolume is a 
mostly independent filesystem tree), but it sounds like you're already 
doing that, so I don't think that would be an issue for you.
Hmm...I'll think harder about this.  There is potential for us to 
artificially divide access to files across subvolumes automatically 
because of the way we are using BTRFS as a backing store for our 
parallel file system.  So far even with around 1000 threads across about 
10 machines accessing BTRFS via our parallel filesystem over the wire 
we've not seen issues, but if we do I have some ways out I've not 
explored yet.  Thanks!


Now, there are some other odd theoretical cases that may cause issues 
when dealing with really big filesystems, but they're either really 
specific edge cases (for example, starting with a really small 
filesystem and gradually scaling it up in size as it gets full) or 
happen at scales far larger than what you're talking about (on the order 
of at least double digit petabyte scale).


Yea, our use case will be in the tens of TB to hundreds of TB for the 
foreseeable future, so I'm glad to hear this is relatively standard. 
That was my read of the situation as well.


Thanks!

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/12/2018 12:09 PM, Hans van Kranenburg wrote:

You are in the To: of it:

https://www.spinics.net/lists/linux-btrfs/msg74737.html


Apparently MS365 decided my disabling of junk/clutter filter rules some 
year+ ago wasn't wise and re-enabled it.  I wondered why I wasn't seeing 
my own messages back from the list.  Qu's along with all of my responses 
were in spam.  Go figure, MS marking kernel.org mail spam...


This is exactly what I was looking for, and indeed is a fantastic write 
up I'll need to read over a few times to really have it soak in.  Thank 
you very much Qu!


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Status of FST and mount times

2018-02-14 Thread Ellis H. Wilson III

Hi again -- back with a few more questions:

Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No 
compression.  No quotas enabled.  Many (potentially tens to hundreds) of 
subvolumes, each with tens of snapshots.  No control over size or number 
of files, but directory tree (entries per dir and general tree depth) 
can be controlled in case that's helpful).


1. I've been reading up about the space cache, and it appears there is a 
v2 of it called the free space tree that is much friendlier to large 
filesystems such as the one I am designing for.  It is listed as OK/OK 
on the wiki status page, but there is a note that btrfs progs treats it 
as read only (i.e., btrfs check repair cannot help me without a full 
space cache rebuild is my biggest concern) and the last status update on 
this I can find was circa fall 2016.  Can anybody give me an updated 
status on this feature?  From what I read, v1 and tens of TB filesystems 
will not play well together, so I'm inclined to dig into this.


2. There's another thread on-going about mount delays.  I've been 
completely blind to this specific problem until it caught my eye.  Does 
anyone have ballpark estimates for how long very large HDD-based 
filesystems will take to mount?  Yes, I know it will depend on the 
dataset.  I'm looking for O() worst-case approximations for 
enterprise-grade large drives (12/14TB), as I expect it should scale 
with multiple drives so approximating for a single drive should be good 
enough.


3. Do long mount delays relate to space_cache v1 vs v2 (I would guess 
no, unless it needed to be regenerated)?


Note that I'm not sensitive to multi-second mount delays.  I am 
sensitive to multi-minute mount delays, hence why I'm bringing this up.


FWIW: I am currently populating a machine we have with 6TB drives in it 
with real-world home dir data to see if I can replicate the mount issue.


Thanks,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-14 Thread Ellis H. Wilson III

On 02/14/2018 12:08 PM, Nikolay Borisov wrote:

V1 for large filesystems is jut awful. Facebook have been experiencing
the pain hence they implemented v2. You can view the spacecache tree as
the complement version of the extent tree. v1 cache is implemented as a
hidden inode and even though writes (aka flushing of the freespace
cache) are metadata they are essentially treated as data. This could
potentially lead to priority inversions if cgroups io controller is
involved.

Furthermore, there is at least 1 known deadlock problem in freespace
cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is
really the way to go.


Fantastic.  Thanks for the backstory.  That is what I will plan to use 
then.  I've been operating with whatever the default is (I presume v1 
based on the man page), but haven't yet populated any of our machines 
sufficiently enough to notice performance degradation due to space cache 
problems.



No, the long mount times seems to be due to the fact that in order for a
btrfs filesystem to mount it needs to enumerate its block_groups items
and those are stored in the extent tree, which also holds all of the
information pertaining to allocated extents. So mixing those
data structures in the same tree and the fact that blockgroups are
iterated linearly during mount (check btrfs_read_block_groups) means on
spinning rust with shitty seek times this can take a while.

However, this will really depend on the amount of extents you have and
having taken a look at the thread you referred to it seems there is not
clear-cut reason why mounting is taking so long on that particular
occasion.


Ok; thanks.  To phrase it somewhat more simply, should I expect for 
"normal" datasets (think home directory) that happen to be part of a 
very large BTRFS filesystem (tens of TBs) to take more than 60s to 
mount?  Let's presume there isn't extreme fragmentation or any media 
errors to keep things simple.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-15 Thread Ellis H. Wilson III

On 02/14/2018 06:24 PM, Duncan wrote:

Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
compression.  No quotas enabled.  Many (potentially tens to hundreds) of
subvolumes, each with tens of snapshots.  No control over size or number
of files, but directory tree (entries per dir and general tree depth)
can be controlled in case that's helpful).


??  How can you control both breadth (entries per dir) AND depth of
directory tree without ultimately limiting your number of files?


I technically misspoke when I said "No control over size or number of 
files."  There is an upper-limit to the metadata (not BTRFS, for our 
filesystem) we can store on an accompanying SSD, which limits the number 
of files that ultimately can live on our BTRFS RAID0'd HDDs.  The 
current design is tuned to perform well up to that maximum, but it's a 
relatively shallow tree, so if there were known performance issues with 
more than N files per directory or beyond a specific depth of 
directories I was calling out that I can change the algorithm now.



Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535
limit on directory hard links before additional ones are out-of-lined
into a secondary node, with the entailing performance implications.


Here I interpret "directory hard links" to mean hard links within a 
single directory -- not real directory hard links as in Macs.  It's moot 
anyhow, as we support hard links at a much higher level in our parallel 
file system and no hard-links will exist whatsoever from BTRFS's 
perspective.



So far, so good.  But then above you mention concern about btrfs-progs
treating the free-space-tree (free-space-cache-v2) as read-only, and the
time cost of having to clear and rebuild it after a btrfs check --repair.

Which is what triggered the mismatch warning I mentioned above.  Either
that raid0 data is of throw-away value appropriate to placement on a
raid0, and btrfs check --repair is of little concern as the benefits are
questionable (no guarantees it'll work and the data is either directly
throw-away value anyway, or there's a backup at hand that /does/ have a
tested guarantee of viability, or it's not worthy of being called a
backup in the first place), or it's not.


I think you may be looking at this a touch too black and white, but 
that's probably because I've not been clear about my use-case.  We do 
have mechanisms at a higher level in our parallel file system to do 
scale-out object-based RAID, so in a way the data is "throw-away" in 
that we can lose it without true data loss.  However, one should not 
underestimate the foreground impact of a reconstruction of 60-80TB of 
data, even with architectures like ours that scale reconstruction well. 
When I lose an HDD I fully expect we will need to rebuild that entire 
BTRFS filesystem, and we can.  But I'd like to limit it to real media 
failure.  In other words, if I can't mount my BTRFS filesystem after 
power-fail, and I can't run btrfs check --repair, then in essence I've 
lost a lot of data I need to rebuild for no "good" reason.


Perhaps more critically, when an entire cluster of these systems 
power-fail, if more than N of these running BTRFS come up and require 
check --repair prior to mount due to some commonly triggered BTRFS bug 
(not saying there is one, I'm just conservative), I'm completely hosed. 
Restoring PB's of data from backup is a non-starter.


In short, I've been playing coy about the details of my project and need 
to continue to do so for at least the next 4-6 months, but if you read 
anything about the company I'm emailing from, you can probably make 
reasonable guesses about what I'm trying to do.



It's also worth mentioning that btrfs raid0 mode, as well as single mode,
hobbles the btrfs data and metadata integrity feature, because while
checksums can and are still generated, stored and checked by default, and
integrity problems can still be detected, because raid0 (and single)
includes no redundancy, there's no second copy (raid1/10) or parity
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.


I'm ok with that.  We have a concept called "on-demand reconstruction" 
which permits us to rebuild individual objects in our filesystem 
on-demand (one component of which will be a failed file on one of the 
BTRFS filesystems).  So long as I can identify that a file has been 
corrupted I'm fine.



12-14 TB individual drives?

While you /did/ say enterprise grade so this probably doesn't apply to
you, it might apply to others that will read this.

Be careful that you're not trying to use the "archive application"
targeted SMR drives for general purpose use.


We're using traditional PMR drives for now.  That's available at 12/14TB 
capacity points presently.  I agree with your general sense that SMR 
drives are unlikely to play particularly well with BTRFS for all but the 
truly archival use-case.


Best,

ellis
--
To unsubscribe from this list: send the line

Re: Status of FST and mount times

2018-02-15 Thread Ellis H. Wilson III

On 02/15/2018 06:12 AM, Hans van Kranenburg wrote:

On 02/15/2018 02:42 AM, Qu Wenruo wrote:

Just as said by Nikolay, the biggest problem of slow mount is the size
of extent tree (and HDD seek time)

The easiest way to get a basic idea of how large your extent tree is
using debug tree:

# btrfs-debug-tree -r -t extent 

You would get something like:
btrfs-progs v4.15
extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0  <<<
total bytes 10737418240
bytes used 393216
uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0

That level is would give you some basic idea of the size of your extent
tree.

For level 0, it could contains about 400 items for average.
For level 1, it could contains up to 197K items.
...
For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
( n <= 7 )


Another one to get that data:

https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py

Example, with amount of leaves on level 0 and nodes higher up:

-# ./show_metadata_tree_sizes.py /
ROOT_TREE 336.00KiB 0(20) 1( 1)
EXTENT_TREE   123.52MiB 0(  7876) 1(28) 2( 1)
CHUNK_TREE112.00KiB 0( 6) 1( 1)
DEV_TREE   80.00KiB 0( 4) 1( 1)
FS_TREE  1016.34MiB 0( 64113) 1(   881) 2(52)
CSUM_TREE 777.42MiB 0( 49571) 1(   183) 2( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   336.00KiB 0(20) 1( 1)
DATA_RELOC_TREE16.00KiB 0( 1)


Very helpful information.  Thank you Qu and Hans!

I have about 1.7TB of homedir data newly rsync'd data on a single 
enterprise 7200rpm HDD and the following output for btrfs-debug:


extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
total bytes 6001175126016
bytes used 1832557875200

Hans' (very cool) tool reports:
ROOT_TREE 624.00KiB 0(38) 1( 1)
EXTENT_TREE   327.31MiB 0( 20881) 1(66) 2( 1)
CHUNK_TREE208.00KiB 0(12) 1( 1)
DEV_TREE  144.00KiB 0( 8) 1( 1)
FS_TREE 5.75GiB 0(375589) 1(   952) 2( 2) 3( 1)
CSUM_TREE   1.75GiB 0(114274) 1(   385) 2( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE16.00KiB 0( 1)

Mean mount times across 5 tests: 4.319s (stddev=0.079s)

Taking 100 snapshots (no changes between snapshots however) of the above 
subvolume doesn't appear to impact mount/umount time.  Snapshot creation 
and deletion both operate at between 0.25s to 0.5s.  I am very impressed 
with snapshot deletion in particular now that qgroups is disabled.


I will do more mount testing with twice and three times that dataset and 
see how mount times scale.


All done on 4.5.5.  I really need to move to a newer kernel.

Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-15 Thread Ellis H. Wilson III

On 02/15/2018 01:14 AM, Chris Murphy wrote:

On Wed, Feb 14, 2018 at 9:00 AM, Ellis H. Wilson III  wrote:


Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No compression.
No quotas enabled.  Many (potentially tens to hundreds) of subvolumes, each
with tens of snapshots.


Even if non-catastrophic to lose such a file system, it's big enough
to be tedious and take time to set it up again. I think it's worth
considering one of two things as alternatives:

a. metadata raid1, data single: you lose the striping performance of
raid0, and if it's not randomly filled you'll end up with some disk
contention for reads and writes *but* if you lose a drive you will not
lose the file system. Any missing files on the dead drive will result
in EIO (and I think also a kernel message with path to file), and so
you could just run a script to delete those files and replace them
with backup copies.


This option is on our roadmap for future releases of our parallel file 
system, but unfortunately we do not presently have the time to implement 
the functionality to report from the manager of that btrfs filesystem to 
the pfs manager that said files have gone missing.  We will absolutely 
be revisiting that as an option in early 2019, as replacing just one 
disk instead of N is highly attractive.  Waiting for EIO as you suggest 
in b is a non-starter for us, as we're working at scales sufficiently 
large that we don't want to wait for someone to stumble over a partially 
degraded file.  Pro-active reporting is what's needed, and we'll 
implement that Real Soon Now.



b. Variation on the above would be to put it behind glusterfs
replicated volume. Gluster getting EIO from a brick should cause it to
get a copy from another brick and then fix up the bad one
automatically. Or in your raid0 case, the whole volume is lost, and
glusterfs helps do the full rebuild over 3-7 days while you're still
able to access those 70TB of data normally. Of course, this option
requires having two 70TB storage bricks available.


See my email address, which may help understand why GlusterFS is a 
non-starter.  Nevertheless, the idea is a fine one and we'll have 
something similar going on, but at higher raid levels and across 
typically a dozen or more of such bricks.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-15 Thread Ellis H. Wilson III

On 02/15/2018 11:51 AM, Austin S. Hemmelgarn wrote:
There are scaling performance issues with directory listings on BTRFS 
for directories with more than a few thousand files, but they're not 
well documented (most people don't hit them because most applications 
are designed around the expectation that directory listings will be slow 
in big directories), and I would not expect them to be much of an issue 
unless you're dealing with tens of thousands of files and particularly 
slow storage.


Understood -- thanks.  Then plan is to keep it to around 1k entries per 
directory.  We've done some fairly concrete testing here to find the 
fall-off point for dirent caching in BTRFS, and the sweet-spot between 
having a large number of small directories cached vs. a few massive 
directories cached.  ~1k seems most palatable for our use-case and 
directory tree structure.


I've only ever lost a BTRFS volume to a power failure _once_ in the 
multiple years I've been using it, and that ended up being because the 
power failure trashed the storage device pretty severely (it was 
super-cheap flash storage).  I do know however that there are people who 
have had much worse results than me.


Good to know.  We'll be running power-fail testing over the next couple 
months.  I'm waiting for some hardware to arrive presently.  We'll 
power-cycle fairly large filesystems a few thousand times before we deem 
it safe to ship.  If there are latent bugs in BTRFS still w.r.t. 
power-fail, I can guarantee we'll trip over them...


It's not exactly a 'general sense' or a hunch, issues with BTRFS on SMR 
drives have been pretty well demonstrated in practice, hence Duncan 
making this statement despite the fact that it most likely did not apply 
to you.


Ah, ok, thanks for clarifying.  I appreciate the forewarning regardless.

Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Metadata / Data on Heterogeneous Media

2018-02-15 Thread Ellis H. Wilson III
In discussing the performance of various metadata operations over the 
past few days I've had this idea in the back of my head, and wanted to 
see if anybody had already thought about it before (likely, I would guess).


It appears based on this page:
https://btrfs.wiki.kernel.org/index.php/Btrfs_design
that data and metadata in BTRFS are fairly well isolated from one 
another, particularly in the case of large files.  This appears 
reinforced by a recent comment from Qu ("...btrfs strictly

split metadata and data usage...").

Yet, while there are plenty of options to RAID0/1/10/etc across 
generally homogeneous media types, there doesn't appear to be any 
functionality (at least that I can find) to segment different BTRFS 
internals to different types of devices.  E.G., place metadata trees and 
extent block groups on SSD, and data trees and extent block groups on 
HDD(s).


Is this something that has already been considered (and if so, 
implemented, which would make me extremely happy)?  Is it feasible it is 
hasn't been approached yet?  I admit my internal knowledge of BTRFS is 
fleeting, though I'm trying to work on that daily at this time, so 
forgive me if this is unapproachable for obvious architectural reasons.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata / Data on Heterogeneous Media

2018-02-15 Thread Ellis H. Wilson III

On 02/15/2018 02:06 PM, Adam Borowski wrote:

On Thu, Feb 15, 2018 at 12:15:49PM -0500, Ellis H. Wilson III wrote:

In discussing the performance of various metadata operations over the past
few days I've had this idea in the back of my head, and wanted to see if
anybody had already thought about it before (likely, I would guess).

It appears based on this page:
https://btrfs.wiki.kernel.org/index.php/Btrfs_design
that data and metadata in BTRFS are fairly well isolated from one another,
particularly in the case of large files.  This appears reinforced by a
recent comment from Qu ("...btrfs strictly
split metadata and data usage...").

Yet, while there are plenty of options to RAID0/1/10/etc across generally
homogeneous media types, there doesn't appear to be any functionality (at
least that I can find) to segment different BTRFS internals to different
types of devices.  E.G., place metadata trees and extent block groups on
SSD, and data trees and extent block groups on HDD(s).

Is this something that has already been considered (and if so, implemented,
which would make me extremely happy)?  Is it feasible it is hasn't been
approached yet?  I admit my internal knowledge of BTRFS is fleeting, though
I'm trying to work on that daily at this time, so forgive me if this is
unapproachable for obvious architectural reasons.


Considered: many times.  It's an obvious improvement, and one that shouldn't
even be that hard to implement.  What remains, it's SMoC then SMoR (Simple
Matter of Coding then Simple Matter of Review), but both of those are in
short supply.


Glad to hear it's been discussed, and I understand the issue of 
resources all too well with the project I'm working on.  Maybe if my 
nights and weekends open up...



After the maximum size of inline extents has been lowered, there's no real
point in putting different types of metadata or not-really-metadata on
different media: thus, existing split of data -vs- metadata block groups is
fine.


That was my thought.  Regarding inlined data, I'm actually quite ok with 
that being on SSD, as that would deliver fast access to tiny objects 
where if you went to HDD you'd spend the great majority of your time 
just seeking to the data in question compared to transfer.


Our existing COW filesystem this is replacing actually did exactly this, 
except it would store the first N KB of each and every object on SSD, so 
even for large files you could get the headers out quickly (as many 
indexing apps want to do).


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata / Data on Heterogeneous Media

2018-02-15 Thread Ellis H. Wilson III

On 02/15/2018 02:11 PM, Hugo Mills wrote:

On Thu, Feb 15, 2018 at 12:15:49PM -0500, Ellis H. Wilson III wrote:

In discussing the performance of various metadata operations over
the past few days I've had this idea in the back of my head, and
wanted to see if anybody had already thought about it before
(likely, I would guess).

It appears based on this page:
https://btrfs.wiki.kernel.org/index.php/Btrfs_design
that data and metadata in BTRFS are fairly well isolated from one
another, particularly in the case of large files.  This appears
reinforced by a recent comment from Qu ("...btrfs strictly
split metadata and data usage...").

Yet, while there are plenty of options to RAID0/1/10/etc across
generally homogeneous media types, there doesn't appear to be any
functionality (at least that I can find) to segment different BTRFS
internals to different types of devices.  E.G., place metadata trees
and extent block groups on SSD, and data trees and extent block
groups on HDD(s).

Is this something that has already been considered (and if so,
implemented, which would make me extremely happy)?  Is it feasible
it is hasn't been approached yet?  I admit my internal knowledge of
BTRFS is fleeting, though I'm trying to work on that daily at this
time, so forgive me if this is unapproachable for obvious
architectural reasons.


Well, it's been discussed, and I wrote up a theoretical framework
which should cover a wide range of use-cases:

https://www.spinics.net/lists/linux-btrfs/msg33916.html

I never got round to implementing it, though -- I ran into issues
over storing the properties/metadata needed to configure it.


Very interesting thread.  Thank you for sharing Hugo.  That nomenclature 
is rather expressive, and the design covers a much broader base than I 
was imagining.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Ellis H. Wilson III

On 02/15/2018 08:55 PM, Qu Wenruo wrote:

On 2018年02月16日 00:30, Ellis H. Wilson III wrote:

Very helpful information.  Thank you Qu and Hans!

I have about 1.7TB of homedir data newly rsync'd data on a single
enterprise 7200rpm HDD and the following output for btrfs-debug:

extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2
total bytes 6001175126016
bytes used 1832557875200

Hans' (very cool) tool reports:
ROOT_TREE 624.00KiB 0(    38) 1( 1)
EXTENT_TREE   327.31MiB 0( 20881) 1(    66) 2( 1)


Extent tree is not so large, a little unexpected to see such slow mount.

BTW, how many chunks do you have?

It could be checked by:

# btrfs-debug-tree -t chunk  | grep CHUNK_ITEM | wc -l


Since yesterday I've doubled the size by copying the homdir dataset in 
again.  Here are new stats:


extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2
total bytes 6001175126016
bytes used 3663525969920

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454

$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE   1.14MiB 0(72) 1( 1)
EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
CHUNK_TREE384.00KiB 0(23) 1( 1)
DEV_TREE  272.00KiB 0(16) 1( 1)
FS_TREE11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE16.00KiB 0( 1)

The old mean mount time was 4.319s.  It now takes 11.537s for the 
doubled dataset.  Again please realize this is on an old version of 
BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still 
like to understand this delay more.  Should I expect this to scale in 
this way all the way up to my proposed 60-80TB filesystem so long as the 
file size distribution stays roughly similar?  That would definitely be 
in terms of multiple minutes at that point.



Taking 100 snapshots (no changes between snapshots however) of the above
subvolume doesn't appear to impact mount/umount time.


100 unmodified snapshots won't affect mount time.

It needs new extents, which can be created by overwriting extents in
snapshots.
So it won't really cause much difference if all these snapshots are all
unmodified.


Good to know, thanks!


Snapshot creation
and deletion both operate at between 0.25s to 0.5s.


IIRC snapshot deletion is delayed, so the real work doesn't happen when
"btrfs sub del" returns.


I was using btrfs sub del -C for the deletions, so I believe (if that 
command truly waits for the subvolume to be utterly gone) it captures 
the entirety of the snapshot.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Ellis H. Wilson III

On 02/16/2018 09:20 AM, Hans van Kranenburg wrote:

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.


Thanks Hans.  I will say multiple minutes (by the looks of things, I'll 
end up near to an hour for 60TB if this non-linear scaling continues) to 
mount a filesystem is undesirable, but I won't offer that criticism 
without thinking constructively for a moment:


Help me out by referencing the tree in question if you don't mind, so I 
can better understand the point of picking all these "apples" (I would 
guess for capacity reporting via df, but maybe there's more).


Typical disclaimer that I haven't yet grokked the various inner-workings 
of BTRFS, so this is quite possibly a terrible or unapproachable idea:


On umount, you must already have whatever metadata you were doing the 
tree walk on mount for in-memory (otherwise you would have been able to 
lazily do the treewalk after a quick mount).  Therefore, could we not 
stash this metadata at or associated with, say, the root of the 
subvolumes?  This way you can always determine on mount quickly if the 
cache is still valid (i.e., no situation like: remount with old btrfs, 
change stuff, umount with old btrfs, remount with new btrfs, pain).  I 
would guess generation would be sufficient to determine if the cached 
metadata is valid for the given root block.


This would scale with number of subvolumes (but not snapshots), and 
would be reasonably quick I think.


Thoughts?

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-16 Thread Ellis H. Wilson III

On 02/16/2018 09:42 AM, Ellis H. Wilson III wrote:

On 02/16/2018 09:20 AM, Hans van Kranenburg wrote:

Well, imagine you have a big tree (an actual real life tree outside) and
you need to pick things (e.g. apples) which are hanging everywhere.

So, what you need to to is climb the tree, climb on a branch all the way
to the end where the first apple is... climb back, climb up a bit, go
onto the next branch to the end for the next apple... etc etc

The bigger the tree is, the longer it keeps you busy, because the apples
will be semi-evenly distributed around the full tree, and they're always
hanging at the end of the branch. The speed with which you can climb
around (random read disk access IO speed for btrfs, because your disk
cache is empty when first mounting) determines how quickly you're done.

So, yes.


Thanks Hans.  I will say multiple minutes (by the looks of things, I'll 
end up near to an hour for 60TB if this non-linear scaling continues) to 
mount a filesystem is undesirable, but I won't offer that criticism 
without thinking constructively for a moment:


Help me out by referencing the tree in question if you don't mind, so I 
can better understand the point of picking all these "apples" (I would 
guess for capacity reporting via df, but maybe there's more).


Typical disclaimer that I haven't yet grokked the various inner-workings 
of BTRFS, so this is quite possibly a terrible or unapproachable idea:


On umount, you must already have whatever metadata you were doing the 
tree walk on mount for in-memory (otherwise you would have been able to 
lazily do the treewalk after a quick mount).  Therefore, could we not 
stash this metadata at or associated with, say, the root of the 
subvolumes?  This way you can always determine on mount quickly if the 
cache is still valid (i.e., no situation like: remount with old btrfs, 
change stuff, umount with old btrfs, remount with new btrfs, pain).  I 
would guess generation would be sufficient to determine if the cached 
metadata is valid for the given root block.


This would scale with number of subvolumes (but not snapshots), and 
would be reasonably quick I think.


I see on 02/13 Qu commented regarding a similar idea, except proposed 
perhaps a richer version of my above suggestion (making block group into 
its own tree).  The concern was that it would be a lot of work since it 
modifies the on-disk format.  That's a reasonable worry.


I will get a new kernel, expand my array to around 36TB, and will 
generate a plot of mount times against extents going up to at least 30TB 
in increments of 0.5TB.  If this proves to reach absurd mount time 
delays (to be specific, anything above around 60s is untenable for our 
use), we may very well be sufficiently motivated to implement the above 
improvement and submit it for consideration.  Accordingly, if anybody 
has additional and/or more specific thoughts on the optimization, I am 
all ears.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-20 Thread Ellis H. Wilson III

On 02/16/2018 07:59 PM, Qu Wenruo wrote:

On 2018年02月16日 22:12, Ellis H. Wilson III wrote:

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454


OK, this explains everything.

There are too many chunks.
This means at mount you need to search for block group item 3454 times.

Even each search only needs to iterate 3 tree blocks, multiply it 3454
it would still be a big work.
Although some tree blocks like the root node and level 1 nodes can be
cached, we still need to read about 3500 tree blocks.

If the fs is created using 16K nodesize, this means you need to do
random read for 54M using 16K blocksize.

No wonder it will takes some time.

Normally I would expect 1G chunk for each data and metadata chunk.

If there is nothing special, it means your filesystem is already larger
than 3T.
If your used space is way smaller (less than 30%) than 3.5T, then this
means your chunk usage is pretty low, and in that case, balance to
reduce number of chunks (block groups) would reduce mount time.


The nodesize is 16K, and the filesystem data is 3.32TiB as reported by 
btrfs fi df.  So, from what I am hearing, this mount time is normal for 
a filesystem this size.  Ignoring a more complex and proper fix like the 
ones we've been discussing, would bumping the nodesize reduce the number 
of chunks, thereby reducing the mount time?


I don't see why balance would come into play here -- my understanding 
was that was for aged filesystems.  The only operations I've done on 
here was:

1. Format filesystem clean
2. Create a subvolume
3. rsync our home directories into that new subvolume
4. Create another subvolume
5. rsync our home directories into that new subvolume

Accordingly, zero (or at least, extremely little) data should have been 
overwritten, so I would expect things to be fairly well allocated 
already.  Please correct me if this is naive thinking.



I was using btrfs sub del -C for the deletions, so I believe (if that
command truly waits for the subvolume to be utterly gone) it captures
the entirety of the snapshot.


No, snapshot deletion is completely delayed in background.

-C only ensures that even a powerloss happen after command return, you
won't see the snapshot anywhere, but it will still be deleted in background.


Ah, I had no idea.  Thank you!  Is there any way to "encourage" 
btrfs-cleaner to run at specific times, which I presume is the snapshot 
deletion process you are referring to?  If it can be told to run at a 
given time, can I throttle how fast it works, such that I avoid some of 
the high foreground interruption I've seen in the past?


Thanks,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-21 Thread Ellis H. Wilson III

On 02/20/2018 08:49 PM, Qu Wenruo wrote:

On 2018年02月16日 22:12, Ellis H. Wilson III wrote:

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454



Increasing node size may reduce extent tree size. Although at most
reduce one level AFAIK.

But considering that the higher the node is, the more chance it's
cached, reducing tree height wouldn't bring much performance impact AFAIK.

If one could do real world benchmark to beat or prove my assumption, it
would be much better though.


I'm willing to try this if you tell me exactly what you'd like me to do. 
 I've not mucked with nodesize before, so I'd like to avoid changing it 
to something absurd.



Qu's suggestion is actually independent of all the above reasons, but
does kind of fit in with the fourth as another case of preventative
maintenance.


My suggestion is to use balance to reduce number of block groups, so we
could do less search at mount time.

It's more like reason 2.

But it only works for case where there are a lot of fragments so a lot
of chunks are not fully utilized.
Unfortunately, that's not the case for OP, so my suggestion doesn't make
sense here.


I ran the balance all the same, and the number of chunks has not 
changed.  Before 3454, and after 3454:

 $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454

HOWEVER, the time to mount has gone up somewhat significantly, from 
11.537s to 16.553s, which was very unexpected.  Output from previously 
run commands shows the extent tree metadata grew about 25% due to the 
balance.  Everything else stayed roughly the same, and no additional 
data was added to the system (nor snapshots taken, nor additional 
volumes added, etc):


Before balance:
$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE   1.14MiB 0(72) 1( 1)
EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
CHUNK_TREE384.00KiB 0(23) 1( 1)
DEV_TREE  272.00KiB 0(16) 1( 1)
FS_TREE11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE16.00KiB 0( 1)

After balance:
$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE   1.16MiB 0(73) 1( 1)
EXTENT_TREE   806.50MiB 0( 51419) 1(   196) 2( 1)
CHUNK_TREE384.00KiB 0(23) 1( 1)
DEV_TREE  272.00KiB 0(16) 1( 1)
FS_TREE11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
CSUM_TREE   3.49GiB 0(227920) 1(   804) 2( 2) 3( 1)
QUOTA_TREE0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE16.00KiB 0( 1)


BTW, if OP still wants to try something to possibly to reduce mount time
with same the fs, I could try some modification to current block group
iteration code to see if it makes sense.


I'm glad to try anything if it's helpful to improving BTRFS.  Just let 
me know.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-21 Thread Ellis H. Wilson III

On 02/21/2018 10:03 AM, Hans van Kranenburg wrote:

On 02/21/2018 03:49 PM, Ellis H. Wilson III wrote:

On 02/20/2018 08:49 PM, Qu Wenruo wrote:

My suggestion is to use balance to reduce number of block groups, so we
could do less search at mount time.

It's more like reason 2.

But it only works for case where there are a lot of fragments so a lot
of chunks are not fully utilized.
Unfortunately, that's not the case for OP, so my suggestion doesn't make
sense here.


I ran the balance all the same, and the number of chunks has not
changed.  Before 3454, and after 3454:
  $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454

HOWEVER, the time to mount has gone up somewhat significantly, from
11.537s to 16.553s, which was very unexpected.  Output from previously
run commands shows the extent tree metadata grew about 25% due to the
balance.  Everything else stayed roughly the same, and no additional
data was added to the system (nor snapshots taken, nor additional
volumes added, etc):

Before balance:
$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE   1.14MiB 0(    72) 1( 1)
EXTENT_TREE   644.27MiB 0( 41101) 1(   131) 2( 1)
CHUNK_TREE    384.00KiB 0(    23) 1( 1)
DEV_TREE  272.00KiB 0(    16) 1( 1)
FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
CSUM_TREE   3.50GiB 0(228593) 1(   791) 2( 2) 3( 1)
QUOTA_TREE    0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE    16.00KiB 0( 1)

After balance:
$ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/
ROOT_TREE   1.16MiB 0(    73) 1( 1)
EXTENT_TREE   806.50MiB 0( 51419) 1(   196) 2( 1)
CHUNK_TREE    384.00KiB 0(    23) 1( 1)
DEV_TREE  272.00KiB 0(    16) 1( 1)
FS_TREE    11.55GiB 0(754442) 1(  2179) 2( 5) 3( 2)
CSUM_TREE   3.49GiB 0(227920) 1(   804) 2( 2) 3( 1)
QUOTA_TREE    0.00B
UUID_TREE  16.00KiB 0( 1)
FREE_SPACE_TREE   0.00B
DATA_RELOC_TREE    16.00KiB 0( 1)


Heu, interesting.

What's the output of `btrfs fi df /mountpoint` and `grep btrfs
/proc/self/mounts` (does it contain 'ssd') and which kernel version is
this? (I get a bit lost in the many messages and subthreads in this
thread) I also can't find in the threads which command "the balance" means.


Short recap:
- I found long mount time for 1.65TB of home dir data at ~4s
- Doubling this data on the same btrfs fs to 3.3TB increased mount time 
to 11s
- Qu et. al. suggested balance might reduce chunks, which came in around 
3400, and the chunk walk on mount was the driving factor in terms of time

- I ran balance
- Mount time went up to 16s, and all else remains the same except the 
extent tree.


$ sudo btrfs fi df /mnt/btrfs
Data, single: total=3.32TiB, used=3.32TiB
System, DUP: total=8.00MiB, used=384.00KiB
Metadata, DUP: total=16.50GiB, used=15.82GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

$ sudo grep btrfs /proc/self/mounts
/dev/sdb /mnt/btrfs btrfs rw,relatime,space_cache,subvolid=5,subvol=/ 0 0

 $ uname -a
Linux  4.5.5-300.fc24.x86_64 #1 SMP Thu May 19 13:05:32 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux


I plan to rerun this on a newer kernel, but haven't had time to spin up 
another machine with a modern kernel yet, and this machine is also being 
used for other things right now so I can't just upgrade it.



And what does this tell you?

https://github.com/knorrie/python-btrfs/blob/develop/examples/show_free_space_fragmentation.py


$ sudo ./show_free_space_fragmentation.py /mnt/btrfs
No Free Space Tree (space_cache=v2) found!
Falling back to using the extent tree to determine free space extents.
vaddr 6529453391872 length 1073741824 used_pct 27 free space fragments 1 
score 0

Skipped because of usage > 90%: 3397 chunks

Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Speedup btrfs_read_block_groups()

2018-02-22 Thread Ellis H. Wilson III

On 02/21/2018 11:56 PM, Qu Wenruo wrote:

On 2018年02月22日 12:52, Qu Wenruo wrote:

btrfs_read_block_groups() is used to build up the block group cache for
all block groups, so it will iterate all block group items in extent
tree.

For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
thousands times, which is the most time consuming part of mounting
btrfs.

So this patch will try to speed it up by:

1) Avoid unnecessary readahead
We were using READA_FORWARD to search for block group item.
However block group items are in fact scattered across quite a lot of
leaves. Doing readahead will just waste our IO (especially important
for HDD).

In real world case, for a filesystem with 3T used space, it would
have about 50K extent tree leaves, but only have 3K block group
items. Meaning we need to iterate 16 leaves to meet one block group
on average.

So readahead won't help but waste slow HDD seeks.

2) Use chunk mapping to locate block group items
Since one block group item always has one corresponding chunk item,
we could use chunk mapping to get the block group item size.

With block group item size, we can do a pinpoint tree search, instead
of searching with some uncertain value and do forward search.

In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
current path, we could save such unnecessary tree block read.

Cc: Ellis H. Wilson III 


Hi Ellis,

Would you please try this patch to see if it helps to speedup the mount
of your large filesystem?


I will try either tomorrow or over the weekend.  I'm waiting on hardware 
to be able to build and load a custom kernel on.


Thanks so much for taking a stab at this!

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: Speedup btrfs_read_block_groups()

2018-02-23 Thread Ellis H. Wilson III

On 02/22/2018 06:37 PM, Qu Wenruo wrote:

On 2018年02月23日 00:31, Ellis H. Wilson III wrote:

On 02/21/2018 11:56 PM, Qu Wenruo wrote:

On 2018年02月22日 12:52, Qu Wenruo wrote:

btrfs_read_block_groups() is used to build up the block group cache for
all block groups, so it will iterate all block group items in extent
tree.

For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM
thousands times, which is the most time consuming part of mounting
btrfs.

So this patch will try to speed it up by:

1) Avoid unnecessary readahead
     We were using READA_FORWARD to search for block group item.
     However block group items are in fact scattered across quite a
lot of
     leaves. Doing readahead will just waste our IO (especially important
     for HDD).

     In real world case, for a filesystem with 3T used space, it would
     have about 50K extent tree leaves, but only have 3K block group
     items. Meaning we need to iterate 16 leaves to meet one block group
     on average.

     So readahead won't help but waste slow HDD seeks.

2) Use chunk mapping to locate block group items
     Since one block group item always has one corresponding chunk item,
     we could use chunk mapping to get the block group item size.

     With block group item size, we can do a pinpoint tree search,
instead
     of searching with some uncertain value and do forward search.

     In some case, like next BLOCK_GROUP_ITEM is in the next leaf of
     current path, we could save such unnecessary tree block read.

Cc: Ellis H. Wilson III 


Hi Ellis,

Would you please try this patch to see if it helps to speedup the mount
of your large filesystem?


I will try either tomorrow or over the weekend.  I'm waiting on hardware
to be able to build and load a custom kernel on.


If you're using Archlinux, I could build the package for you.

(For other distributions, unfortunately I'm not that familiar with)

Thanks,
Qu


No sweat.  I'm not running arch anywhere, so was glad to handle this myself.

Short story: It doesn't appear to have any notable impact on mount time.

Long story:
#Built a modern kernel:
git clone https://github.com/kdave/btrfs-devel
cd'd into btrfs-devel
copied my current kernel config in /boot to .config
make olddefconfig
make -j16
make modules_install
make install
grub2-mkconfig -o /boot/grub/grub.cfg
reboot

#Reran tests with vanilla 4.16.0-rc1+ kernel
As root, of the form: time mount /dev/sdb /mnt/btrfs
5 iteration average: 16.869s

#Applied your patch, rebuild, switched kernel module
wget -O - 'https://patchwork.kernel.org/patch/10234619/mbox' | git am -
make -j16
make modules_install
rmmod btrfs
modprobe btrfs

#Reran tests with patched 4.16.0-rc1+ kernel
As root, of the form: time mount /dev/sdb /mnt/btrfs
5 iteration average: 16.642s

So, there's a slight improvement against vanilla 4.16.0-rc1+, but it's 
still slightly slower than my original runs in 4.5.5, which got me 
16.553s.  In any event, most of this is statistically unsignificant 
since the standard deviation is about two tenths of a second.


So, my conclusion here is this problem needs to be handled at an 
architectural level to be truly solved (read: have mounts that few 
seconds at worst), which either requires:
a) On-disk format changes like you (Qu) suggested some time back for a 
tree of block groups or
b) Lazy block group walking post-mount and algorithms that can cope with 
making sub-optimal choices.  One would likely want to stonewall out 
certain operations until the lazy post-mount walk completed like 
balance, defrag, etc, that have more reason to require complete 
knowledge of the usage of each block group.


I may take a stab at b), but I'm first going to do the tests I promised 
relating to how mount times scale with increased capacity consumption 
for varying filesizes.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html