Re: Status of FST and mount times

Ellis H. Wilson III Thu, 15 Feb 2018 08:31:09 -0800

On 02/14/2018 06:24 PM, Duncan wrote:

Frame-of-reference here: RAID0.  Around 70TB raw capacity.  No
compression.  No quotas enabled.  Many (potentially tens to hundreds) of
subvolumes, each with tens of snapshots.  No control over size or number
of files, but directory tree (entries per dir and general tree depth)
can be controlled in case that's helpful).


??  How can you control both breadth (entries per dir) AND depth of
directory tree without ultimately limiting your number of files?

I technically misspoke when I said "No control over size or number offiles." There is an upper-limit to the metadata (not BTRFS, for ourfilesystem) we can store on an accompanying SSD, which limits the numberof files that ultimately can live on our BTRFS RAID0'd HDDs. Thecurrent design is tuned to perform well up to that maximum, but it's arelatively shallow tree, so if there were known performance issues withmore than N files per directory or beyond a specific depth ofdirectories I was calling out that I can change the algorithm now.

Anyway, AFAIK the only performance issue there would be the (IIRC) ~65535
limit on directory hard links before additional ones are out-of-lined
into a secondary node, with the entailing performance implications.

Here I interpret "directory hard links" to mean hard links within asingle directory -- not real directory hard links as in Macs. It's mootanyhow, as we support hard links at a much higher level in our parallelfile system and no hard-links will exist whatsoever from BTRFS'sperspective.

So far, so good.  But then above you mention concern about btrfs-progs
treating the free-space-tree (free-space-cache-v2) as read-only, and the
time cost of having to clear and rebuild it after a btrfs check --repair.

Which is what triggered the mismatch warning I mentioned above.  Either
that raid0 data is of throw-away value appropriate to placement on a
raid0, and btrfs check --repair is of little concern as the benefits are
questionable (no guarantees it'll work and the data is either directly
throw-away value anyway, or there's a backup at hand that /does/ have a
tested guarantee of viability, or it's not worthy of being called a
backup in the first place), or it's not.

I think you may be looking at this a touch too black and white, butthat's probably because I've not been clear about my use-case. We dohave mechanisms at a higher level in our parallel file system to doscale-out object-based RAID, so in a way the data is "throw-away" inthat we can lose it without true data loss. However, one should notunderestimate the foreground impact of a reconstruction of 60-80TB ofdata, even with architectures like ours that scale reconstruction well.When I lose an HDD I fully expect we will need to rebuild that entireBTRFS filesystem, and we can. But I'd like to limit it to real mediafailure. In other words, if I can't mount my BTRFS filesystem afterpower-fail, and I can't run btrfs check --repair, then in essence I'velost a lot of data I need to rebuild for no "good" reason.

Perhaps more critically, when an entire cluster of these systemspower-fail, if more than N of these running BTRFS come up and requirecheck --repair prior to mount due to some commonly triggered BTRFS bug(not saying there is one, I'm just conservative), I'm completely hosed.Restoring PB's of data from backup is a non-starter.

In short, I've been playing coy about the details of my project and needto continue to do so for at least the next 4-6 months, but if you readanything about the company I'm emailing from, you can probably makereasonable guesses about what I'm trying to do.

It's also worth mentioning that btrfs raid0 mode, as well as single mode,
hobbles the btrfs data and metadata integrity feature, because while
checksums can and are still generated, stored and checked by default, and
integrity problems can still be detected, because raid0 (and single)
includes no redundancy, there's no second copy (raid1/10) or parity
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.

I'm ok with that. We have a concept called "on-demand reconstruction"which permits us to rebuild individual objects in our filesystemon-demand (one component of which will be a failed file on one of theBTRFS filesystems). So long as I can identify that a file has beencorrupted I'm fine.

12-14 TB individual drives?

While you /did/ say enterprise grade so this probably doesn't apply to
you, it might apply to others that will read this.

Be careful that you're not trying to use the "archive application"
targeted SMR drives for general purpose use.

We're using traditional PMR drives for now. That's available at 12/14TBcapacity points presently. I agree with your general sense that SMRdrives are unlikely to play particularly well with BTRFS for all but thetruly archival use-case.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

Reply via email to