Richard Sharpe posted on Mon, 30 Jan 2012 11:35:31 -0800 as excerpted: > I am interested in any feedback on tuning btrfs for throughput? > > I am running on 3.2.1 and have set up btrfs across 11 7200RPM 1TB 3.5" > drives. I told btrfs to mirror metadata and stripe data. > > For my current simple throughput tests I am running dd with 256kiB > blocks and 1M blocks (memory is 64Gib). > > All tests are done with conv=fdatasync and then with and without > oflags=direct. > > I get around 800MB/s in the non DIRECTIO case, and around 430MB/s in the > DIRECTIO case (which is pretty impressive it seems to me). > > However, what I would like to know is are there any tuning parameters I > can tweak to push the numbers up a bit?
AFAIK (just researching btrfs at this point, I'm not a dev and am not run it yet), the code is still in high enough flux that trying to fine tune for current performance isn't a particularly good idea as what's best now might not be best in a couple kernel cycles. A rather big exception to that could be specific sizes. Btrfs likes powers of two, and 1 TB (base-10) disks are obviously not 1 TiB, more like 930(-ish) GiB. It may be that 896 GiB (512+256+128) sizing will give you slightly better performance than using the full 930-ish GiB drives. You could also try playing around with the various mkfs.btrfs size parameters, --alloc-start, --leafsize, --nodesize, and --sectorsize. More stable than fine-tuning btrfs tweaks at this point would probably be partition alignment on the physical disk, and general kernel vfs parameters. Many modern disks use 4 KiB physical sectors while still using 512-byte logical sectors for compatibility. Getting the alignment exactly right with them can make a *BIG* difference in performance. Using a good partitioner (such as gptfdisk, aka gdisk, for gpt-based partitioning, as opposed to the old mbr-base partitioning), you should be able to select alignment. Alternatively, you can use the mkfs.btrfs --alloc-start parameter mentioned above to realign btrfs data structures within a larger partition or the unpartitioned full disk. It's worth noting that due to MS compatibility efforts, sometimes 4 KiB physical sector disks are themselves offset, so you can't simply align to 4 KiB and call it good, for best performance you'd need to test 4 KiB blocks at each of the 512-byte logical sector boundaries. If you have such disks, one of those alignments should be measurably better than all the others, perhaps by several times! Kernel vfs parameters... that's a discussion for elsewhere. > I see lots of idle time (80+%) on my 16 cores (probably two by four by > two). If you're willing to trade CPU cycles for I/O bandwidth, definitely investigate btrfs' mount-time compression options! zlib-based compression is the older and more stable one, lzo compression is newer and faster, but not as good a compression ratio. It's the default when compression is enabled in newer kernels. Google's snappy compression is the newest option, generally as fast as lzo with the compression of zlib, but I'm not sure whether it's available in kernel 3.2 yet. There has been recent discussion on-list of a fourth option, IDR the name, but it's supposedly faster than snappy at about the same compression ratio. But zlib should give you the better compression currently and is more mature, so I'd recommend it as long as I/O is the bottleneck, not CPU. There's a number of other performance-tuning mount options with various tradeoffs. Since you're striping data, performance apparently overrides data integrity for you, so nodatasum may be of interest. nobarrier is another "unsafe but boosts performance" option. One more in this category, notreelog. Consider nodatacow as well, altho if you're doing a lot of copying, copy-on-write may in fact be higher performing due to not needing to actually write so much. A small/zero number for max_inline=<number> could increase performance, at the expense of space, just how much space will depend on mkfs-time parameters. space_cache and inode_cache should increase performance, altho it's worth noting that inode_cache will probably slow it down on the first run after enabling, where it wasn't enabled before. noacl may be appropriate as well, if you don't need ACLs for security reasons. thread_pool=<number> may be useful as well with 16-way SMP, but I don't know the default so couldn't say for sure. The above mount-options list is from the wiki's getting started page. The mount options section is at the bottom, below all the distro-specific stuff. https://btrfs.wiki.kernel.org/articles/g/e/t/Getting_started.html Of course, the not-btrfs-specific noatime,nodiratime mount options apply too, but you probably already knew or at least figured that. > Would I be better of with 10 drives rather than 11, or 12 rather than > 11? I'm not sure on this, but it's possible that an even number of drives may work slightly better given the metadata mirroring. Other than that, basic RAID bus logic applies; the optimum number of drives depend on your data bus topology. If your data buses are saturated, adding more drives won't help, but a bus can typically handle several spinning media drives before saturation. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html