Re: Understanding BTRFS RAID0 Performance

Austin S. Hemmelgarn Mon, 08 Oct 2018 05:20:25 -0700

On 2018-10-05 20:34, Duncan wrote:

Wilson, Ellis posted on Fri, 05 Oct 2018 15:29:52 +0000 as excerpted:

Is there any tuning in BTRFS that limits the number of outstanding reads
at a time to a small single-digit number, or something else that could
be behind small queue depths?  I can't otherwise imagine what the
difference would be on the read path between ext4 vs btrfs when both are
on mdraid.


It seems I forgot to directly answer that question in my first reply.
Thanks for restating it.

Btrfs doesn't really expose much performance tuning (yet?), at least
outside the code itself.  There are a few very limited knobs, but they're
just that, few and limited or broad-stroke.

There are mount options like ssd/nossd, ssd_spread/nossd_spread, the
space_cache set of options (see below), flushoncommit/noflushoncommit,
commit=<seconds>, etc (see the btrfs (5) manpage), but nothing really to
influence stride length, etc, or to optimize chunk placement between ssd
and non-ssd devices, for instance.

And there's a few filesystem features, normally set at mkfs.btrfs time
(and thus covered in the mkfs.btrfs manpage) but some of which can be
tuned later, but generally, the defaults have changed over time to
reflect the best case, and the older variants are there primarily to
retain backward compatibility with old kernels and tools that didn't
handle the newer variants.

That said, as I think about it there are some tunables that may be worth
experimenting with.  Most or all of these are covered in the btrfs (5)
manpage.

* Given the large device numbers you mention and raid0, you're likely
dealing with multi-TB-scale filesystems.  At this level, the
space_cache=v2 mount option may be useful.  It's not the default yet as
btrfs check, etc, don't yet handle it, but given your raid0 choice you
may not be concerned about that.  Need only be given once after which v2
is "on" for the filesystem until turned off.

* Consider experimenting with the thread_pool=n mount option.  I've seen
very little discussion of this one, but given your interest in
parallelization, it could make a difference.

Probably not as much as you might think. I'll explain a bit morefurther down where this is being mentioned again.


* Possibly the commit=<seconds> (default 30) mount option.  In theory,
upping this may allow better write merging, tho your interest seems to be
more on the read side, and the commit time has consequences at crash time.

Based on my own experience, having a higher commit time doesn't impactread or write performance much or really help all that much with writemerging. All it really helps with is minimizing overhead, but it's noteven all that great at doing that.


* The autodefrag mount option may be considered if you do a lot of
existing file updates, as is common with database or VM image files.  Due
to COW this triggers high fragmentation on btrfs, and autodefrag should
help control that.  Note that autodefrag effectively increases the
minimum extent size from 4 KiB to, IIRC, 16 MB, tho it may be less, and
doesn't operate at whole-file size, so larger repeatedly-modified files
will still have some fragmentation, just not as much.  Obviously, you
wouldn't see the read-time effects of this until the filesystem has aged
somewhat, so it may not show up on your benchmarks.

(Another option for such files is setting them nocow or using the
nodatacow mount option, but this turns off checksumming and if it's on,
compression for those files, and has a few other non-obvious caveats as
well, so isn't something I recommend.  Instead of using nocow, I'd
suggest putting such files on a dedicated traditional non-cow filesystem
such as ext4, and I consider nocow at best a workaround option for those
who prefer to use btrfs as a single big storage pool and thus don't want
to do the dedicated non-cow filesystem for some subset of their files.)

* Not really for reads but for btrfs and any cow-based filesystem, you
almost certainly want the (not btrfs specific) noatime mount option.

Actually... This can help a bit for some workloads. Just like thecommit time, it comes down to a matter of overhead. Essentially, if youread a file regularly, than with the default of relatime, you've got aguaranteed write requiring a commit of the metadata tree once every 24hours. It's not much to worry about for just one file, but if you'rereading a very large number of files all the time, it can really add up.


* While it has serious filesystem integrity implications and thus can't
be responsibly recommended, there is the nobarrier mount option.  But if
you're already running raid0 on a large number of devices you're already
gambling with device stability, and this /might/ be an additional risk
you're willing to take, as it should increase performance.  But for
normal users it's simply not worth the risk, and if you do choose to use
it, it's at your own risk.

Agreed, if you're running RAID0 with this many drives, nobarrier may beworth it for a little bit of extra performance. It will make writes abit faster, and make them have less impact on concurrent reads.


* If you're enabling the discard mount option, consider trying with it
off, as it can affect performance if your devices don't support queued-
trim.  The alternative is fstrim, presumably scheduled to run once a week
or so.  (The util-linux package includes an fstrim systemd timer and
service set to run once a week.  You can activate that, or equivalent
cron job if you're not on systemd.)

Even if you have queued discard support, you may still be better offusing fstrim instead. While queuing discards reduces their performanceimpact, some device firmware still can't handle them efficiently.Pretty much, test both ways, see which works better for your workload.


* For filesystem features you may look at no_holes and skinny_metadata.
These are both quite stable and at least skinny-metadata is now the
default.  These are normally set at mkfs.btrfs time, but can be modified
later.  Setting at mkfs time should be more efficient.

* At mkfs.btrfs time, you can set metadata --nodesize.  The newer default
is 16 KiB, while the old default was the (minimum for amd64/x86) 4 KiB,
and the maximum is 64 KiB.  See the mkfs.btrfs manpage for the details as
there's a tradeoff, smaller sizes increase (metadata) fragmentation but
decrease lock contention, while larger sizes pack more efficiently and
are less fragmented but updating is more expensive.  The change in
default was because 16 KiB was a win over the old 4 KiB for most use-
cases, but the 32 or 64 KiB options may or may not be, depending on use-
case, and of course if you're bottlenecking on locks, 4 KiB may still be
a win.

One caveat here, if you're running on top of another RAID platform, youcan often get a small performance boost by matching the node size to thechunks size for the underlying RAID layer (so, the chunk size thatreplication is done at for replicated RAID, or the amount of data perdisk per stripe for striped stuff).



Among all those, I'd be especially interested in what thread_pool=n does
or doesn't do for you, both because it specifically mentions
parallelization and because I've seen little discussion of it.

There's been little discussion because the default value that getsselected is actually near optimal in all but the largest systems. Thedefault logic is to set this to either the total number of logical coresin the system or 8, whichever is less. What this does is actuallyrather simple, it's functionally the maximum number of I/O requests thatcan be processed concurrently by BTRFS for that volume.

Now, in theory it might sound like increasing this should improve thingshere. The problem with that is that beyond about 8 requests, you startto see the effects of lock contention a _lot_ more. If you can find away to mitigate the locking issues (check the end of my reply for moreabout that), bumping this up _might_ help, but it generally should stillnot be more than the number of logical cores in the system (I've donesome testing myself, no matter how well you have lock contentionmitigated, performance gains are at best negligible from using morethreads than logical cores, and at worst you'll make performancesignificantly worse).


space_cache=v2 may also be a big boost for you, if you're filesystems are
the size the 6-device raid0 implies and are at all reasonably populated.

(Metadata) nodesize may or may not make a difference, tho I suspect if so
it'll be mostly on writes (but I'm not familiar with the specifics there
so could be wrong).  I'd be interested to see if it does.

In general I can recommend the no_holes and skinny_metadata features but
you may well already have them, and the noatime mount option, which you
may well already be using as well.  Similarly, I ensure that all my btrfs
are mounted from first mount with autodefrag, so it's always on as the
filesystem is populated, but I doubt you'll see a difference from that in
your benchmarks unless you're specifically testing an aged filesystem
that would be heavily fragmented on its own.


There's one guy here who has done heavy testing on the ssd stuff and
knows btrfs on-device chunk allocation strategies very well, having come
up with a utilization visualization utility and been the force behind the
relatively recent (4.16-ish) changes to the ssd mount option's allocation
strategy.  He'd be the one to talk to if you're considering diving into
btrfs' on-disk allocation code, etc.


There are two other recommendations I would make:

* Stupid as it sounds, depending on your workload, you may actually seebetter performance with the single profile than the raid0 profile.Essentially, if you've got mostly big files that would span multipledevices in raid0 mode and you don't have a workload that needsconcurrent access to the same file regularly, you can reduce contentionfor access to each individual device by running with the data profileset to single.

* If you can find some way to logically subdivide your workload, youshould look at creating one subvolume per subdivision. This will reducelock contention (and thus make bumping up the `thread_pool` optionactually have some benefits).

Re: Understanding BTRFS RAID0 Performance

Reply via email to