On 02/15/2018 02:42 AM, Qu Wenruo wrote: > > > On 2018年02月15日 01:08, Nikolay Borisov wrote: >> >> >> On 14.02.2018 18:00, Ellis H. Wilson III wrote: >>> Hi again -- back with a few more questions: >>> >>> Frame-of-reference here: RAID0. Around 70TB raw capacity. No >>> compression. No quotas enabled. Many (potentially tens to hundreds) of >>> subvolumes, each with tens of snapshots. No control over size or number >>> of files, but directory tree (entries per dir and general tree depth) >>> can be controlled in case that's helpful). >>> >>> 1. I've been reading up about the space cache, and it appears there is a >>> v2 of it called the free space tree that is much friendlier to large >>> filesystems such as the one I am designing for. It is listed as OK/OK >>> on the wiki status page, but there is a note that btrfs progs treats it >>> as read only (i.e., btrfs check repair cannot help me without a full >>> space cache rebuild is my biggest concern) and the last status update on >>> this I can find was circa fall 2016. Can anybody give me an updated >>> status on this feature? From what I read, v1 and tens of TB filesystems >>> will not play well together, so I'm inclined to dig into this. >> >> V1 for large filesystems is jut awful. Facebook have been experiencing >> the pain hence they implemented v2. You can view the spacecache tree as >> the complement version of the extent tree. v1 cache is implemented as a >> hidden inode and even though writes (aka flushing of the freespace >> cache) are metadata they are essentially treated as data. This could >> potentially lead to priority inversions if cgroups io controller is >> involved. >> >> Furthermore, there is at least 1 known deadlock problem in freespace >> cache v1. So yes, if you want to use btrfs ona multi-tb system v2 is >> really the way to go. >> >>> >>> 2. There's another thread on-going about mount delays. I've been >>> completely blind to this specific problem until it caught my eye. Does >>> anyone have ballpark estimates for how long very large HDD-based >>> filesystems will take to mount? Yes, I know it will depend on the >>> dataset. I'm looking for O() worst-case approximations for >>> enterprise-grade large drives (12/14TB), as I expect it should scale >>> with multiple drives so approximating for a single drive should be good >>> enough. >>> >>> 3. Do long mount delays relate to space_cache v1 vs v2 (I would guess >>> no, unless it needed to be regenerated)? >> >> No, the long mount times seems to be due to the fact that in order for a >> btrfs filesystem to mount it needs to enumerate its block_groups items >> and those are stored in the extent tree, which also holds all of the >> information pertaining to allocated extents. So mixing those >> data structures in the same tree and the fact that blockgroups are >> iterated linearly during mount (check btrfs_read_block_groups) means on >> spinning rust with shitty seek times this can take a while. > > And, space cache is not loaded at mount time. > It's delayed until we determine to allocate extent from one block group. > > So space cache is completely unrelated to long mount time. > >> >> However, this will really depend on the amount of extents you have and >> having taken a look at the thread you referred to it seems there is not >> clear-cut reason why mounting is taking so long on that particular >> occasion . > > Just as said by Nikolay, the biggest problem of slow mount is the size > of extent tree (and HDD seek time) > > The easiest way to get a basic idea of how large your extent tree is > using debug tree: > > # btrfs-debug-tree -r -t extent <device> > > You would get something like: > btrfs-progs v4.15 > extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776 level 0 <<< > total bytes 10737418240 > bytes used 393216 > uuid 651fcf0c-0ffd-4351-9721-84b1615f02e0 > > That level is would give you some basic idea of the size of your extent > tree. > > For level 0, it could contains about 400 items for average. > For level 1, it could contains up to 197K items. > ... > For leven n, it could contains up to 400 * 493 ^ (n - 1) items. > ( n <= 7 )
Another one to get that data: https://github.com/knorrie/python-btrfs/blob/master/examples/show_metadata_tree_sizes.py Example, with amount of leaves on level 0 and nodes higher up: -# ./show_metadata_tree_sizes.py / ROOT_TREE 336.00KiB 0( 20) 1( 1) EXTENT_TREE 123.52MiB 0( 7876) 1( 28) 2( 1) CHUNK_TREE 112.00KiB 0( 6) 1( 1) DEV_TREE 80.00KiB 0( 4) 1( 1) FS_TREE 1016.34MiB 0( 64113) 1( 881) 2( 52) CSUM_TREE 777.42MiB 0( 49571) 1( 183) 2( 1) QUOTA_TREE 0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 336.00KiB 0( 20) 1( 1) DATA_RELOC_TREE 16.00KiB 0( 1) > > Thanks, > Qu > >> >> >>> >>> Note that I'm not sensitive to multi-second mount delays. I am >>> sensitive to multi-minute mount delays, hence why I'm bringing this up. >>> >>> FWIW: I am currently populating a machine we have with 6TB drives in it >>> with real-world home dir data to see if I can replicate the mount issue. >>> >>> Thanks, >>> >>> ellis >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html