BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.

Austin S. Hemmelgarn Tue, 11 Apr 2017 08:41:23 -0700

About a year ago now, I decided to set up a small storage cluster tostore backups (and partially replace Dropbox for my usage, but that's aseparate story). I ended up using GlusterFS as the clustering softwareitself, and BTRFS as the back-end storage.

GlusterFS itself is actually a pretty easy workload as far as clustersoftware goes. It does some processing prior to actually storing thedata (a significant amount in fact), but the actual on-device storage onany given node is pretty simple. You have the full directory structurefor the whole volume, and whatever files happen to be on that node arelocated within that tree exactly like they are in the GlusterFS volume.Beyond the basic data, gluster only stores 2-4 xattrs per-file (whichare used to track synchronization, and also for it's internal datascrubbing), and a directory called .glusterfs in the top of the back-endstorage location for the volume which contains the data required tofigure out which node a file is on. Overall, the access patterns mostlymirror whatever is using the Gluster volume, or are reduced to slowstreaming writes (when writing files and the back-end nodes arecomputationally limited instead of I/O limited), with the addition ofsome serious metadata operations in the .glusterfs directory (lots ofstat calls there, together with large numbers of small files).

As far as overall performance, BTRFS is actually on par for this usagewith both ext4 and XFS (at least, on my hardware it is), and I actuallysee more SSD friendly access patterns when using BTRFS in this case thanany other FS I tried.

After some serious experimentation with various configurations for thisduring the past few months, I've noticed a handful of other things:

1. The 'ssd' mount option does not actually improve performance on theseSSD's. To a certain extent, this actually surprised me at first, buthaving seen Hans' e-mail and what he found about this option, itactually makes sense, since erase-blocks on these devices are 4MB, not2MB, and the drives have a very good FTL (so they will aggregate all thelittle writes properly).

Given this, I'm beginning to wonder if it actually makes sense to notautomatically enable this on mount when dealing with certain types ofstorage (for example, most SATA and SAS SSD's have reasonably goodFTL's, so I would expect them to have similar behavior). Extrapolatingfurther, it might instead make sense to just never automatically enablethis, and expose the value this option is manipulating as a mount optionas there are other circumstances where setting specific values couldimprove performance (for example, if you're on hardware RAID6, settingthis to the stripe size would probably improve performance on manycheaper controllers).

2. Up to a certain point, running a single larger BTRFS volume withmultiple sub-volumes is more computationally efficient than runningmultiple smaller BTRFS volumes. More specifically, there is lower loadon the system and lower CPU utilization by BTRFS itself without muchnoticeable difference in performance (in my tests it was about 0.5-1%performance difference, YMMV). To a certain extent this makes somesense, but the turnover point was actually a lot higher than I expected(with this workload, the turnover point was around half a terabyte).

I believe this to be a side-effect of how we use per-filesystemworker-pools. In essence, we can schedule parallel access better whenit's all through the same worker pool than we can when using multipleworker pools. Having realized this, I think it might be interesting tosee if using a worker-pool per physical device (or at least what thesystem sees as a physical device) might make more sense in terms ofperformance than our current method of using a pool per-filesystem.

3. On these SSD's, running a single partition in dup mode is actuallymarginally more efficient than running 2 partitions in raid1 mode. Iwas actually somewhat surprised by this, and I haven't been able to finda clear explanation as to why (I suspect caching may have something todo with it, but I'm not 100% certain about that), but some limitedtesting with other SSD's seems to indicate that it's the case for mostSSD's, with the difference being smaller on smaller and faster devices.On a traditional hard disk, it's significantly more efficient, butthat's generally to be expected.

4. Depending on other factors, compression can actually slow you downpretty significantly. In the particular case I saw this happen (allcores completely utilized by userspace software), LZO compressionactually caused around 5-10% performance degradation compared to nocompression. This is somewhat obvious once it's explained, but it's notexactly intuitive and as such it's probably worth documenting in theman pages that compression won't always make things better. I may senda patch to add this at some point in the near future.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

BTRFS as a GlusterFS storage back-end, and what I've learned from using it as such.

Reply via email to