Dmitry Katsubo posted on Sun, 18 Oct 2015 11:44:08 +0200 as excerpted: [Regarding the btrfs raid1 "device-with-the-most-space" chunk-allocation strategy.]
> I think the mentioned strategy (fill in the device with most free space) > is not most effective. If the data is spread equally, the read > performance would be higher (reading from 3 disks instead of 2). In my > case this is even crucial, because the smallest drive is SSD (and it is > not loaded at all). > > Maybe I don't see the benefit from the strategy which is currently > implemented (besides that it is robust and well-tested)? Two comments: 1) As Hugo alluded to, in striped mode (raid0/5/6 and I believe 10), the chunk allocator goes wide, allocating a chunk from each device with free space, then striping at something smaller (64 KiB maybe?). When the smallest device is full, it reduces the width by one and continues allocating, down to the minimum stripe width for the raid type. However, raid1 and single do device-with-the-most-space first, thus, particularly for raid1, ensuring maximum usage of available space. Were raid1 to do width-first, capacity would be far lower and much more of the largest device would remain unusable, because some chunk pairs would be allocated entirely on the smaller devices, meaning less of the largest device would be used before the smaller devices fill up and no more raid1 chunks could be allocated as only the single largest device has free space left and raid1 requires allocation on two separate devices. In the three-device raid1 case, the difference in usable capacity would be 1/3 the capacity of the smallest device, since until it is full, 1/3 of all allocations would be to the two smaller devices, leaving that much more space unusable on the largest device. So you see there's a reason for most-space-first, that being that it forces one chunk from each pair-allocation to the largest device, thereby most efficiently distributing space so as to leave as little space as possible unusable due to only one device left when pair-allocation is required. 2) There has been talk of a more flexible chunk allocator with an admin- specified strategy allowing smart use of hybrid ssd/disk filesystems, for instance. Perhaps put the metadata on the ssds, for instance, since btrfs metadata is relatively hot as in addition to the traditional metadata, it contains the checksums which btrfs of course checks on read. However, this sort of thing is likely to be some time off, as it's relatively lower priority than various other possible features. Unfortunately, given the rate of btrfs development, "some time off" is in practice likely to be at least five years out. In the mean time, there's technologies such as bcache that allow hybrid caching of "hot" data, designed to present themselves as virtual block devices so btrfs as well as other filesystems can layer on top. And in fact, we have some regular users that have btrfs on top of bcache actually deployed, and from reports, it now works quite well. (There were some problems awhile in the past, but they're several years in the past now, back well before the last couple LTS kernel series that's the oldest recommended for btrfs deployment.) If you're interested, start a new thread with btrfs on bcache in the subject line, and you'll likely get some very useful replies. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html