Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On Wed, 27 Jul 2016 18:25:48 +0200 Goffredo Baroncelli wrote: > I am not able to understand this sentence: on the best of my knowledge, > in btrfs the RAID5/RAID6 stripe is composed by several sub-stripes (I am > not sure about the terminology to adopt); the number of sub-stripe is equal > to the number of the disk. > > Until now, in btrfs the size of sub-stripe is fixed to 64k, so the size > of a stripe is equal to 64k * . So for raid5 the minimum > stripe size is 192k, for raid6 is 256k. > Why you are writing that the real stripe size is 4kb (may be that you are > referring to the be the page size ?). No problem with going over the details one more time. What I called and what was agreed to be called stripe size in the email sent originally sent by Chris Murphy (link below) is actually how a single block of data is laid out on disk. This number is a component of the stripe element size (which you called the sub-stripe). This has nothing to do with how DIFFERENT but concurrent stripes (which you defined as 64k * ) of data are laid out on disk. Their relation is such that they follow the order when they are read, but are otherwise unrelated to each other. The correct way to look at a stripe is as follows (with its current value before this patch in brackets): Stripe Element Size (64 KiB) = Stripe size (1024B) * Number of elements per stripe element (64) For the stripe code, the stripe element size matters, and for the metadata, the stripe size matters. The (64k * ) is how concurrent stripes of data are distributed across the RAID disks. The order is only important when performing an I/O operation. Reference email: https://www.spinics.net/lists/linux-btrfs/msg57471.html Hope this makes it simpler. Sanidhya -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
Hi Sanidhya, On 2016-07-27 08:12, Sanidhya Solanki wrote: > The reason for this limit is the fact that, as I noted above the real > stripe size is currently 4KiB, with an element size of 64KiB. I am not able to understand this sentence: on the best of my knowledge, in btrfs the RAID5/RAID6 stripe is composed by several sub-stripes (I am not sure about the terminology to adopt); the number of sub-stripe is equal to the number of the disk. Until now, in btrfs the size of sub-stripe is fixed to 64k, so the size of a stripe is equal to 64k * . So for raid5 the minimum stripe size is 192k, for raid6 is 256k. Why you are writing that the real stripe size is 4kb (may be that you are referring to the be the page size ?). I am quite sure that the problem is the terminology. Could you be so kindly to explain what you are meaning ? Thanks in advance. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On Tue, 26 Jul 2016 11:14:37 -0600 Chris Murphy wrote: > On Fri, Jul 22, 2016 at 8:58 AM, Austin S. Hemmelgarn > wrote: > > On 2016-07-22 09:42, Sanidhya Solanki wrote: > > >> +*stripesize=*;; > >> +Specifies the new stripe size > > It'd be nice to stop conflating stripe size and stripe element size as > if they're the same thing. I realize that LVM gets this wrong also, > and uses stripes to mean "data strips", and stripesize for stripe > element size. From a user perspective I find the inconsistency > annoying, users are always confused about these terms. > > So I think we need to pay the piper now, and use either strip size or > stripe element size for this. Stripe size is the data portion of a > full stripe read or write across all devices in the array. So right > now with a 64KiB stripe element size on Btrfs, the stripe size for a 4 > disk raid0 is 256KiB, and the stripe size for a 4 disk raid 5 is > 192KiB. I absolutely agree with the statement regarding the difference between those two separate settings. THis difference was more clearly visible pre-Dec 2015, when it was removed for code appearance reasons by commit ee22184b53c823f6956314c2815d4068e3820737 (at the end of the commit).I will update the documentation in the next patch to make it clear that the balance option affects stripe size directly and the stripe element size indirectly. > It's 64KiB right now. Why go so much smaller? > > mdadm goes from 4KiB to GiB's, with a 512KiB default. > > lvm goes from 4KiB to the physical extent size, which can be GiB's. > > I'm OK with an upper limit that's sane, maybe 16MiB? Hundreds of MiB's > or even GiB's seems a bit far fetched but other RAID tools on Linux > permit that. The reason for this limit is the fact that, as I noted above the real stripe size is currently 4KiB, with an element size of 64KiB. Ostensibly, we can change the stripe size to any 512B multiple that is less than 64KiB. Increasing it beyond 64KiB is risky because a lot of calculations (only the basis of which I modified for this patch, and not the dependencies of those algorithms and calculations) rely on the stripe element size being 64KiB. I do not want to increase this limit as it may lead to un-discovered bugs in the already buggy RAID 5/6 code. If this patch is accepted, I intend in the next few patches to do the following: -increase maximum stripe size to 64KiB, by reducing the number of blocks to 1 per stripe extent. -Update the documentation to notify user of this change and the need for caution, as well as trial and error, to find an appropriate size upto 64KiB, with a warning to only change it if users understand the consequences and reasons for the change, as suggested by ASH. -Clean up the RAID 5/6 recovery code and stripe code over the coming months. -Clean up the code that relies on calculations that depend on stripe size and their dependencies. -Remove this stripe size and stripe element size limitation completely, as suggested by both ASH and CMu. Just waiting on reviews and acceptance for this patch as the basis of the above work. I started on the RAID recovery code yesterday. It also appears that according to the commit that I stated above that the stripe size used to be 1KiB, with 64 elements per stripe element, but was changed in Dec 2015, so maybe as long as you do not change the stripe size to be more than 64KiB, you do not need to balance after using this balance option (atleast the first time). I do not remember seeing any bug reports on the mailing list since then that called out stripe size as the problem. Interesting. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On 2016-07-26 13:14, Chris Murphy wrote: On Fri, Jul 22, 2016 at 8:58 AM, Austin S. Hemmelgarn wrote: On 2016-07-22 09:42, Sanidhya Solanki wrote: +*stripesize=*;; +Specifies the new stripe size It'd be nice to stop conflating stripe size and stripe element size as if they're the same thing. I realize that LVM gets this wrong also, and uses stripes to mean "data strips", and stripesize for stripe element size. From a user perspective I find the inconsistency annoying, users are always confused about these terms. So I think we need to pay the piper now, and use either strip size or stripe element size for this. Stripe size is the data portion of a full stripe read or write across all devices in the array. So right now with a 64KiB stripe element size on Btrfs, the stripe size for a 4 disk raid0 is 256KiB, and the stripe size for a 4 disk raid 5 is 192KiB. for a filesystem instance. Multiple BTrFS +filesystems mounted in parallel with varying stripe size are supported, the only +limitation being that the stripe size provided to balance in this option must +be a multiple of 512 bytes, and greater than 512 bytes, but not larger than +16 KiBytes. It's 64KiB right now. Why go so much smaller? mdadm goes from 4KiB to GiB's, with a 512KiB default. lvm goes from 4KiB to the physical extent size, which can be GiB's. I'm OK with an upper limit that's sane, maybe 16MiB? Hundreds of MiB's or even GiB's seems a bit far fetched but other RAID tools on Linux permit that. 16M makes sense as an upper limit to me. In practice, I've never heard of anyone using a stripe element size larger than that with LVM, and it's twice the largest erase block size I've seen on any consumer flash devices (and the optimal situation on a flash drive or SSD for device life is a stripe element size equal to your erase block size), and to be honest, I think most of the reason thatt LVM allows that insanity of multi-GB stripe element sizes is just because they didn't care to put an upper limit on it. I'm actually somewhat curious to see numbers for sizes larger than 16k. In most cases, that probably will be either higher or lower than the point at which performance starts suffering. On an set of fast SSD's, that's almost certainly lower than the turnover point (I can't give an opinion on BTRFS, but for DM-RAID, the point at which performance starts degrading significantly is actually 64k on the SSD's I use), while on a set of traditional hard drives, it may be as low as 4k (yes, I have actually seen systems where this is the case). I think that we should warn about sizes larger than 16k, not refuse to use them, especially because the point of optimal performance will shift when we get proper I/O parallelization. Or, better yet, warn about changing this at all, and assume that if the user continues they know what they're doing. OK well maybe someone wants to inform the mdadm and LVM folks that their defaults are awfully large for SSD's. It's been quite a long time both were using 64KiB to no ill effect on hard drives, and maybe 5 years ago that mdadm moved to a 512KiB default. LVM's default works fine on all the SSD's I've got, and mdadm has been seeing a decline in new usage for a while now, so I doubt either is an issue. In either case, people who actually care about performance are likely to be tuning it themselves instead of using the defaults anyway. I think allowing the user to specify 512 byte strip sizes is a bad idea. This will increase read-modify-write by the drive firmware on all modern HDD's now that use 4096 byte physical sectors, and SSDs with page sizes 16KiB or greater being common. Ideally we'd have a way of knowing the page size of the drive set that as the minimum, rather than a hard coded minimum. Setting the minimum to 4k would seem reasonable to me. The only situations I've seen where it actually makes sense to go smaller than that is when dealing with huge numbers of small files on old 512b sector disks that don't support command queuing on systems which can't cache anything. The number of such systems is declining to begin with, and the number that could reasonably run BTRFS given other performance constraints is probably zero. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On Fri, Jul 22, 2016 at 8:58 AM, Austin S. Hemmelgarn wrote: > On 2016-07-22 09:42, Sanidhya Solanki wrote: >> +*stripesize=*;; >> +Specifies the new stripe size It'd be nice to stop conflating stripe size and stripe element size as if they're the same thing. I realize that LVM gets this wrong also, and uses stripes to mean "data strips", and stripesize for stripe element size. From a user perspective I find the inconsistency annoying, users are always confused about these terms. So I think we need to pay the piper now, and use either strip size or stripe element size for this. Stripe size is the data portion of a full stripe read or write across all devices in the array. So right now with a 64KiB stripe element size on Btrfs, the stripe size for a 4 disk raid0 is 256KiB, and the stripe size for a 4 disk raid 5 is 192KiB. >for a filesystem instance. Multiple BTrFS >> +filesystems mounted in parallel with varying stripe size are supported, >> the only >> +limitation being that the stripe size provided to balance in this option >> must >> +be a multiple of 512 bytes, and greater than 512 bytes, but not larger >> than >> +16 KiBytes. It's 64KiB right now. Why go so much smaller? mdadm goes from 4KiB to GiB's, with a 512KiB default. lvm goes from 4KiB to the physical extent size, which can be GiB's. I'm OK with an upper limit that's sane, maybe 16MiB? Hundreds of MiB's or even GiB's seems a bit far fetched but other RAID tools on Linux permit that. > I'm actually somewhat curious to see numbers for sizes larger than 16k. In > most cases, that probably will be either higher or lower than the point at > which performance starts suffering. On an set of fast SSD's, that's almost > certainly lower than the turnover point (I can't give an opinion on BTRFS, > but for DM-RAID, the point at which performance starts degrading > significantly is actually 64k on the SSD's I use), while on a set of > traditional hard drives, it may be as low as 4k (yes, I have actually seen > systems where this is the case). I think that we should warn about sizes > larger than 16k, not refuse to use them, especially because the point of > optimal performance will shift when we get proper I/O parallelization. Or, > better yet, warn about changing this at all, and assume that if the user > continues they know what they're doing. OK well maybe someone wants to inform the mdadm and LVM folks that their defaults are awfully large for SSD's. It's been quite a long time both were using 64KiB to no ill effect on hard drives, and maybe 5 years ago that mdadm moved to a 512KiB default. I think allowing the user to specify 512 byte strip sizes is a bad idea. This will increase read-modify-write by the drive firmware on all modern HDD's now that use 4096 byte physical sectors, and SSDs with page sizes 16KiB or greater being common. Ideally we'd have a way of knowing the page size of the drive set that as the minimum, rather than a hard coded minimum. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On 2016-07-22 12:06, Sanidhya Solanki wrote: On Fri, 22 Jul 2016 10:58:59 -0400 "Austin S. Hemmelgarn" wrote: On 2016-07-22 09:42, Sanidhya Solanki wrote: +*stripesize=*;; +Specifies the new stripe size for a filesystem instance. Multiple BTrFS +filesystems mounted in parallel with varying stripe size are supported, the only +limitation being that the stripe size provided to balance in this option must +be a multiple of 512 bytes, and greater than 512 bytes, but not larger than +16 KiBytes. These limitations exist in the user's best interest. due to sizes too +large or too small leading to performance degradations on modern devices. + +It is recommended that the user try various sizes to find one that best suit the +performance requirements of the system. This option renders the RAID instance as +in-compatible with previous kernel versions, due to the basis for this operation +being implemented through FS metadata. + I'm actually somewhat curious to see numbers for sizes larger than 16k. In most cases, that probably will be either higher or lower than the point at which performance starts suffering. On an set of fast SSD's, that's almost certainly lower than the turnover point (I can't give an opinion on BTRFS, but for DM-RAID, the point at which performance starts degrading significantly is actually 64k on the SSD's I use), while on a set of traditional hard drives, it may be as low as 4k (yes, I have actually seen systems where this is the case). I think that we should warn about sizes larger than 16k, not refuse to use them, especially because the point of optimal performance will shift when we get proper I/O parallelization. Or, better yet, warn about changing this at all, and assume that if the user continues they know what they're doing. I agree with you from a limited point of view. Your considerations are relevant for a more broad, but general, set of circumstances. My consideration is worst case scenario, particularly on SSDs, where, say, you pick 8KiB or 16 KiB, write out all your data, then delete a block, which will have to be read-erase-written on a multi-page level, usually 4KiB in size. I don't know what SSD's you've been looking at, but the erase block size on all of the modern NAND MLC based SSD's I've seen is between 1 and 8 megabytes, so it would lead to at most a single erase block being rewritten. Even most of the NAND SLC based SSD's I've seen have at least a 64k erase block. Overall, the only case this is reasonably going to lead to a multi-page rewrite is if the filesystem isn't properly aligned, which is not a likely situation for most people. On HDDs, this will make the problem of fragmenting even worse. On HDDs, I would only recommend setting stripe block size to the block level (usually 4KiB native, 512B emulated), but this just me focusing on the worst case scenario. And yet, software RAID implementations do fine with larger stripe sizes. On my home server, I'm using BTRFS in RAID1 mode on top of LVM managed DM-RAID0 volumes, and I actually have gone through testing every power of 2 stripe size in this configuration for the DM-RAID volumes from 1k up to 64k. I get peak performance using a 16k stripe size, and the performance actually falls off faster at lower sizes than it does at higher ones (at least, within the range I checked). I've seen similar results on all the server systems I manage for work as well, so it's not just consumer hard drives that behave like this. Maybe I will add these warnings in a follow-on patch, if others agree with these statements and concerns. The other part of my issue with this which forgot to state is that two types of people are likely to use this feature: 1. Those who actually care about performance and are willing to test multiple configurations to find an optimal one. 2. Those who claim to care about performance, but either just twiddle things randomly or blindly follow advice from others without really knowing for certain what they're doing. The only people settings like this actually help to a reasonable degree are in the first group. Putting a upper limit on the stripe size caters to protecting the second group (who shouldn't be using this to begin with) at the expense of the first group. This doesn't affect data safety (or at least, it shouldn't), it only impacts performance, the system is still usable even if this is set poorly, so the value of trying to make it resistant to stupid users is not all that great. Additionally, unless you have numbers to back up 16k being the practical maximum on most devices, then it's really just an arbitrary number, which is something that should be avoided in management tools. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On Fri, 22 Jul 2016 10:58:59 -0400 "Austin S. Hemmelgarn" wrote: > On 2016-07-22 09:42, Sanidhya Solanki wrote: > > +*stripesize=*;; > > +Specifies the new stripe size for a filesystem instance. Multiple BTrFS > > +filesystems mounted in parallel with varying stripe size are supported, > > the only > > +limitation being that the stripe size provided to balance in this option > > must > > +be a multiple of 512 bytes, and greater than 512 bytes, but not larger than > > +16 KiBytes. These limitations exist in the user's best interest. due to > > sizes too > > +large or too small leading to performance degradations on modern devices. > > + > > +It is recommended that the user try various sizes to find one that best > > suit the > > +performance requirements of the system. This option renders the RAID > > instance as > > +in-compatible with previous kernel versions, due to the basis for this > > operation > > +being implemented through FS metadata. > > + > I'm actually somewhat curious to see numbers for sizes larger than 16k. > In most cases, that probably will be either higher or lower than the > point at which performance starts suffering. On an set of fast SSD's, > that's almost certainly lower than the turnover point (I can't give an > opinion on BTRFS, but for DM-RAID, the point at which performance starts > degrading significantly is actually 64k on the SSD's I use), while on a > set of traditional hard drives, it may be as low as 4k (yes, I have > actually seen systems where this is the case). I think that we should > warn about sizes larger than 16k, not refuse to use them, especially > because the point of optimal performance will shift when we get proper > I/O parallelization. Or, better yet, warn about changing this at all, > and assume that if the user continues they know what they're doing. I agree with you from a limited point of view. Your considerations are relevant for a more broad, but general, set of circumstances. My consideration is worst case scenario, particularly on SSDs, where, say, you pick 8KiB or 16 KiB, write out all your data, then delete a block, which will have to be read-erase-written on a multi-page level, usually 4KiB in size. On HDDs, this will make the problem of fragmenting even worse. On HDDs, I would only recommend setting stripe block size to the block level (usually 4KiB native, 512B emulated), but this just me focusing on the worst case scenario. Maybe I will add these warnings in a follow-on patch, if others agree with these statements and concerns. Thanks Sanidhya -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Make RAID stripesize configurable
On 2016-07-22 09:42, Sanidhya Solanki wrote: Adds the user-space component of making the RAID stripesize user configurable. Updates the btrfs-documentation to provide the information to users. Adds parsing capabilities for the new options. Adds the means of transfering the data to kernel space. Updates the kernel ioctl interface to account for new options. Updates the user-space component of RAID stripesize management. Updates the TODO list for future tasks. Patch applies to the v4.6.1 release branch. Signed-off-by: Sanidhya Solanki --- Documentation/btrfs-balance.asciidoc | 14 + btrfs-convert.c | 59 +++- btrfs-image.c| 4 ++- btrfsck.h| 2 +- chunk-recover.c | 8 +++-- cmds-balance.c | 45 +-- cmds-check.c | 4 ++- disk-io.c| 10 -- extent-tree.c| 4 ++- ioctl.h | 10 -- mkfs.c | 18 +++ raid6.c | 3 ++ utils.c | 4 ++- volumes.c| 18 --- volumes.h| 12 +--- 15 files changed, 170 insertions(+), 45 deletions(-) diff --git a/Documentation/btrfs-balance.asciidoc b/Documentation/btrfs-balance.asciidoc index 7df40b9..fd61523 100644 --- a/Documentation/btrfs-balance.asciidoc +++ b/Documentation/btrfs-balance.asciidoc @@ -32,6 +32,7 @@ The filters can be used to perform following actions: - convert block group profiles (filter 'convert') - make block group usage more compact (filter 'usage') - perform actions only on a given device (filters 'devid', 'drange') +- perform an operation that changes the stripe size for a RAID instance The filters can be applied to a combination of block group types (data, metadata, system). Note that changing 'system' needs the force option. @@ -157,6 +158,19 @@ is a range specified as 'start..end'. Makes sense for block group profiles that utilize striping, ie. RAID0/10/5/6. The range minimum and maximum are inclusive. +*stripesize=*;; +Specifies the new stripe size for a filesystem instance. Multiple BTrFS +filesystems mounted in parallel with varying stripe size are supported, the only +limitation being that the stripe size provided to balance in this option must +be a multiple of 512 bytes, and greater than 512 bytes, but not larger than +16 KiBytes. These limitations exist in the user's best interest. due to sizes too +large or too small leading to performance degradations on modern devices. + +It is recommended that the user try various sizes to find one that best suit the +performance requirements of the system. This option renders the RAID instance as +in-compatible with previous kernel versions, due to the basis for this operation +being implemented through FS metadata. + I'm actually somewhat curious to see numbers for sizes larger than 16k. In most cases, that probably will be either higher or lower than the point at which performance starts suffering. On an set of fast SSD's, that's almost certainly lower than the turnover point (I can't give an opinion on BTRFS, but for DM-RAID, the point at which performance starts degrading significantly is actually 64k on the SSD's I use), while on a set of traditional hard drives, it may be as low as 4k (yes, I have actually seen systems where this is the case). I think that we should warn about sizes larger than 16k, not refuse to use them, especially because the point of optimal performance will shift when we get proper I/O parallelization. Or, better yet, warn about changing this at all, and assume that if the user continues they know what they're doing. *soft*:: Takes no parameters. Only has meaning when converting between profiles. When doing convert from one profile to another and soft mode is on, diff --git a/btrfs-convert.c b/btrfs-convert.c index b18de59..dc796d0 100644 --- a/btrfs-convert.c +++ b/btrfs-convert.c @@ -278,12 +278,14 @@ static int intersect_with_sb(u64 bytenr, u64 num_bytes) { int i; u64 offset; + extern u32 sz_stripe; + extern u32 stripe_width; for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { offset = btrfs_sb_offset(i); - offset &= ~((u64)BTRFS_STRIPE_LEN - 1); + offset &= ~((u64)((sz_stripe) * (stripe_width)) - 1); - if (bytenr < offset + BTRFS_STRIPE_LEN && + if (bytenr < offset + ((sz_stripe) * (stripe_width)) && bytenr + num_bytes > offset) return 1; } @@ -603,6 +605,8 @@ static int block_iterate_proc(u64 disk_block, u64 file_block, int ret = 0; int sb_region; int do_barrier; + extern u32 sz_stripe; + extern u