Re: Status of FST and mount times

Qu Wenruo Tue, 20 Feb 2018 17:49:51 -0800


On 2018年02月20日 23:41, Austin S. Hemmelgarn wrote:
> On 2018-02-20 09:59, Ellis H. Wilson III wrote:
>> On 02/16/2018 07:59 PM, Qu Wenruo wrote:
>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
>>>> $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
>>>> 3454
>>>
>>> OK, this explains everything.
>>>
>>> There are too many chunks.
>>> This means at mount you need to search for block group item 3454 times.
>>>
>>> Even each search only needs to iterate 3 tree blocks, multiply it 3454
>>> it would still be a big work.
>>> Although some tree blocks like the root node and level 1 nodes can be
>>> cached, we still need to read about 3500 tree blocks.
>>>
>>> If the fs is created using 16K nodesize, this means you need to do
>>> random read for 54M using 16K blocksize.
>>>
>>> No wonder it will takes some time.
>>>
>>> Normally I would expect 1G chunk for each data and metadata chunk.
>>>
>>> If there is nothing special, it means your filesystem is already larger
>>> than 3T.
>>> If your used space is way smaller (less than 30%) than 3.5T, then this
>>> means your chunk usage is pretty low, and in that case, balance to
>>> reduce number of chunks (block groups) would reduce mount time.
>>
>> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by
>> btrfs fi df.  So, from what I am hearing, this mount time is normal
>> for a filesystem this size.  Ignoring a more complex and proper fix
>> like the ones we've been discussing, would bumping the nodesize reduce
>> the number of chunks, thereby reducing the mount time?
> It would probably not.  Chunk size is only based on the total size of
> the filesystem, with reasonable base values, so you would still need to
> have at least as many chunks to store the same amount of data (increase
> the node size too much though, and you will end up with more chunks,
> because you'll have more empty space wasted).


Increasing node size may reduce extent tree size. Although at most
reduce one level AFAIK.

But considering that the higher the node is, the more chance it's
cached, reducing tree height wouldn't bring much performance impact AFAIK.

If one could do real world benchmark to beat or prove my assumption, it
would be much better though.

>>
>> I don't see why balance would come into play here -- my understanding
>> was that was for aged filesystems.  The only operations I've done on
>> here was:
>> 1. Format filesystem clean
>> 2. Create a subvolume
>> 3. rsync our home directories into that new subvolume
>> 4. Create another subvolume
>> 5. rsync our home directories into that new subvolume
>>
>> Accordingly, zero (or at least, extremely little) data should have
>> been overwritten, so I would expect things to be fairly well allocated
>> already.  Please correct me if this is naive thinking.
> Your logic is in general correct regarding data, but not necessarily
> metadata.  Assuming you did not use the `--inplace` option for rsync, it
> had to issue a rename for each individual file that got copied in, and
> as a result there was likely a lot of metadata being rewritten.
> 
> As far as balance being for aged filesystems, that's not exactly true.
> There are four big reasons you might run a balance:
> 
> 1. As part of reshaping a volume.  You generally want run a balance
> whenever the number of disks in a volume permanently increases (it will
> happen automatically when it permanently decreases, as the device
> deletion operation is a special type of balance under the hood).  It's
> also used for converting chunk profiles.
> 2. To free up empty space inside chunks when the filesystem is full at
> the chunk level.
> 3. To redistribute data across multiple disks in a more even manner
> after deleting a lot of data.
> 4. To reduce the likelihood of 2 or 3 being an issue.
> 
> Reasons 2 and 3 are generally more likely to be needed on old volumes.
> Reason 1 is independent of the age of a volume.  Reason 4 is the reason
> for the regular filtered balances that I and some other people recommend
> be run as part of preventative maintenance, and is also generally
> independent of the age of a volume.
> 
> Qu's suggestion is actually independent of all the above reasons, but
> does kind of fit in with the fourth as another case of preventative
> maintenance.

My suggestion is to use balance to reduce number of block groups, so we
could do less search at mount time.

It's more like reason 2.

But it only works for case where there are a lot of fragments so a lot
of chunks are not fully utilized.
Unfortunately, that's not the case for OP, so my suggestion doesn't make
sense here.

BTW, if OP still wants to try something to possibly to reduce mount time
with same the fs, I could try some modification to current block group
iteration code to see if it makes sense.

Thanks,
Qu

>>
>>>> I was using btrfs sub del -C for the deletions, so I believe (if that
>>>> command truly waits for the subvolume to be utterly gone) it captures
>>>> the entirety of the snapshot.
>>>
>>> No, snapshot deletion is completely delayed in background.
>>>
>>> -C only ensures that even a powerloss happen after command return, you
>>> won't see the snapshot anywhere, but it will still be deleted in
>>> background.
>>
>> Ah, I had no idea.  Thank you!  Is there any way to "encourage"
>> btrfs-cleaner to run at specific times, which I presume is the
>> snapshot deletion process you are referring to?  If it can be told to
>> run at a given time, can I throttle how fast it works, such that I
>> avoid some of the high foreground interruption I've seen in the past?
> I don't think there's any way to do this right now (though it would be
> nice if there was).  In theory, you could adjust the priority of the
> kernel thread itself, but messing around with kthread priorities is
> seriously dangerous even if you know exactly what you're doing.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

signature.asc
Description: OpenPGP digital signature

Re: Status of FST and mount times

Reply via email to