Re: Status of FST and mount times

Austin S. Hemmelgarn Thu, 15 Feb 2018 08:51:26 -0800

On 2018-02-15 10:42, Ellis H. Wilson III wrote:

On 02/14/2018 06:24 PM, Duncan wrote:
Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
compression.  No quotas enabled.  Many (potentially tens to hundreds) of
subvolumes, each with tens of snapshots.  No control over size or number
of files, but directory tree (entries per dir and general tree depth)
can be controlled in case that's helpful).
??  How can you control both breadth (entries per dir) AND depth of
directory tree without ultimately limiting your number of files?
I technically misspoke when I said "No control over size or number offiles." There is an upper-limit to the metadata (not BTRFS, for ourfilesystem) we can store on an accompanying SSD, which limits the numberof files that ultimately can live on our BTRFS RAID0'd HDDs. Thecurrent design is tuned to perform well up to that maximum, but it's arelatively shallow tree, so if there were known performance issues withmore than N files per directory or beyond a specific depth ofdirectories I was calling out that I can change the algorithm now.

There are scaling performance issues with directory listings on BTRFSfor directories with more than a few thousand files, but they're notwell documented (most people don't hit them because most applicationsare designed around the expectation that directory listings will be slowin big directories), and I would not expect them to be much of an issueunless you're dealing with tens of thousands of files and particularlyslow storage.

Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535
limit on directory hard links before additional ones are out-of-lined
into a secondary node, with the entailing performance implications.
Here I interpret "directory hard links" to mean hard links within asingle directory -- not real directory hard links as in Macs. It's mootanyhow, as we support hard links at a much higher level in our parallelfile system and no hard-links will exist whatsoever from BTRFS'sperspective.
So far, so good.  But then above you mention concern about btrfs-progs
treating the free-space-tree (free-space-cache-v2) as read-only, and the
time cost of having to clear and rebuild it after a btrfs check --repair.

Which is what triggered the mismatch warning I mentioned above.  Either
that raid0 data is of throw-away value appropriate to placement on a
raid0, and btrfs check --repair is of little concern as the benefits are
questionable (no guarantees it'll work and the data is either directly
throw-away value anyway, or there's a backup at hand that /does/ have a
tested guarantee of viability, or it's not worthy of being called a
backup in the first place), or it's not.
I think you may be looking at this a touch too black and white, butthat's probably because I've not been clear about my use-case. We dohave mechanisms at a higher level in our parallel file system to doscale-out object-based RAID, so in a way the data is "throw-away" inthat we can lose it without true data loss. However, one should notunderestimate the foreground impact of a reconstruction of 60-80TB ofdata, even with architectures like ours that scale reconstruction well.When I lose an HDD I fully expect we will need to rebuild that entireBTRFS filesystem, and we can. But I'd like to limit it to real mediafailure. In other words, if I can't mount my BTRFS filesystem afterpower-fail, and I can't run btrfs check --repair, then in essence I'velost a lot of data I need to rebuild for no "good" reason.
Perhaps more critically, when an entire cluster of these systemspower-fail, if more than N of these running BTRFS come up and requirecheck --repair prior to mount due to some commonly triggered BTRFS bug(not saying there is one, I'm just conservative), I'm completely hosed.Restoring PB's of data from backup is a non-starter.

Whether or not this is likely to be an issue is just as much dependenton the storage hardware as how BTRFS handles it. In my own experience,I've only ever lost a BTRFS volume to a power failure _once_ in themultiple years I've been using it, and that ended up being because thepower failure trashed the storage device pretty severely (it wassuper-cheap flash storage). I do know however that there are people whohave had much worse results than me.

In short, I've been playing coy about the details of my project and needto continue to do so for at least the next 4-6 months, but if you readanything about the company I'm emailing from, you can probably makereasonable guesses about what I'm trying to do.
It's also worth mentioning that btrfs raid0 mode, as well as single mode,
hobbles the btrfs data and metadata integrity feature, because while
checksums can and are still generated, stored and checked by default, and
integrity problems can still be detected, because raid0 (and single)
includes no redundancy, there's no second copy (raid1/10) or parity
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.
I'm ok with that. We have a concept called "on-demand reconstruction"which permits us to rebuild individual objects in our filesystemon-demand (one component of which will be a failed file on one of theBTRFS filesystems). So long as I can identify that a file has beencorrupted I'm fine.

Somewhat ironically, while BTRFS isn't yet great at fixing things whenthey go wrong, it's pretty good at letting you know something as gonewrong. Unfortunately, it tends to be far more aggressive in doing sothan it sounds like you need it to be.

12-14 TB individual drives?

While you /did/ say enterprise grade so this probably doesn't apply to
you, it might apply to others that will read this.

Be careful that you're not trying to use the "archive application"
targeted SMR drives for general purpose use.
We're using traditional PMR drives for now. That's available at 12/14TBcapacity points presently. I agree with your general sense that SMRdrives are unlikely to play particularly well with BTRFS for all but thetruly archival use-case.

It's not exactly a 'general sense' or a hunch, issues with BTRFS on SMRdrives have been pretty well demonstrated in practice, hence Duncanmaking this statement despite the fact that it most likely did not applyto you.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

Reply via email to