Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Jul 22, 2014, at 11:13 AM, Chris Murphy wrote: > > It's been a while since I did a rebuild on HDDs, So I did this yesterday and day before with an SSD and HDD in raid1, and made the HDD do the rebuild. Baseline for this hard drive: hdparm -t 35.68 MB/sec dd if=/dev/zero of=/dev/rdisk2s1 bs=256k 13508091392 bytes transferred in 521.244920 secs (25915056 bytes/sec) I don't know why hdparm gets such good reads, and dd writes are 75% of that, but the 26MB/s write speed is realistic (this is a Firewire 400 external device) and what I typically get with long sequential writes. It's probable this is interface limited to mode S200, not a drive limitation since on SATA Rev 2 or 3 interface I get 100+MB/s transfers. During the rebuild, iotop reports actual write averaging in the 24MB/s range, and the total data to restore divided by total time for the replace command comes out to 23MB/s. The source data is a Fedora 21 install with no meaningful user data (cache files and such), so mostly a bunch of libraries, programs, and documentation. Therefore it's not exclusively small files, yet the iotop rate was very stable throughout the 4 minute rebuild. So I still think 5MB/s for a SATA connected (?) drive is to be unexpected. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
Stefan Behrens giantdisaster.de> writes: > TM, Just read the man-page. You could have used the replace tool after > physically removing the failing device. > > Quoting the man page: > "If the source device is not available anymore, or if the -r option is > set, the data is built only using the RAID redundancy mechanisms. > > Options > -r only read from if no other zero-defect mirror > exists (enable this if your drive has lots of read errors, > the access would be very slow)" > > Concerning the rebuild performance, the access to the disk is linear for > both reading and writing, I measured above 75 MByte/s at that time with > regular 7200 RPM disks, which would be less than 10 hours to replace a > 3TB disk (in worst case, if it is completely filled up). > Unused/unallocated areas are skipped and additionally improve the > rebuild speed. > > For missing disks, unfortunately the command invocation is not using the > term "missing" but the numerical device-id instead of the device name. > "missing" _is_ implemented in the kernel part of the replace code, but > was simply forgotten in the user mode part, at least it was forgotten in > the man page. > Hi Stefan, thank you very much, for the comprehensive info, I will opt to use replace next time. Breaking news :-) from Jul 19 14:41:36 microserver kernel: [ 1134.244007] btrfs: relocating block group 8974430633984 flags 68 to Jul 22 16:54:54 microserver kernel: [268419.463433] btrfs: relocating block group 2991474081792 flags 65 Rebuild ended before counting down to So flight time was 3 days, and I see no more messages or btrfs processes utilizing cpu. So rebuild seams ready. Just a few hours ago another disk showed some earlly touble accumulating Current_Pending_Sector but no Reallocated_Sector_Ct yet. TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Jul 21, 2014, at 8:51 PM, Duncan <1i5t5.dun...@cox.net> wrote: > >> It does not matter at all what the average file size is. > > … and the filesize /does/ matter. I'm not sure how. A rebuild is replicating chunks, not doing the equivalent of cp or rsync on files. Copying chunks (or strips of chunks in the case of raid10) should be a rather sequential operation. So I'm not sure where the random write behavior would come from that could drop the write performance to ~5MB/s on drives that can read/write ~100MB/s. >> Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle, >> when doing a raid rebuild. > > ... And perfectly reasonable, at least at this point, to expect ~5 MiB/ > sec total thruput, one spindle at a time, for btrfs. It's been a while since I did a rebuild on HDDs, but on SSDs the rebuilds have maxed out the replacement drive. Obviously the significant difference is rotational latency. If everyone with spinning disks and many small files is getting 5MB/s rebuilds, it suggests a rotational latency penalty if the performance is expected. I'm just not sure where that would be coming from. Random IO would incur the effect of rotational latency, but the rebuild shouldn't be random IO, rather sequential. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Tue, 22 Jul 2014 14:43:45 + (UTC), Tm wrote: > Wang Shilong cn.fujitsu.com> writes: > > >> The latest btrfs-progs include man page of btrfs-replace. Actually, you >> could use it >> something like: >> >> btrfs replace start | >> >> You could use 'btrfs file show' to see missing device id. and then run >> btrfs replace. >> > > Hi Wang, > > I physically removed the drive before the rebuild, having a failing device > as a source is not a good idea anyway. > Without the device in place, the device name is not showing up, since the > missing device is not under /dev/sdXX or anything else. > > That is why I asked if the special parameter 'missing' may be used in a > replace. I can't say if it is supported. But I guess not, since I found no > documentation on this matter. > > So I guess replace is not aimed at fault tolerance / rebuilding. It's just > a convenient way to lets lay replace the disks with larger disks , to extend > your array. A convenience tool, not an emergency tool. TM, Just read the man-page. You could have used the replace tool after physically removing the failing device. Quoting the man page: "If the source device is not available anymore, or if the -r option is set, the data is built only using the RAID redundancy mechanisms. Options -r only read from if no other zero-defect mirror exists (enable this if your drive has lots of read errors, the access would be very slow)" Concerning the rebuild performance, the access to the disk is linear for both reading and writing, I measured above 75 MByte/s at that time with regular 7200 RPM disks, which would be less than 10 hours to replace a 3TB disk (in worst case, if it is completely filled up). Unused/unallocated areas are skipped and additionally improve the rebuild speed. For missing disks, unfortunately the command invocation is not using the term "missing" but the numerical device-id instead of the device name. "missing" _is_ implemented in the kernel part of the replace code, but was simply forgotten in the user mode part, at least it was forgotten in the man page. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
Wang Shilong cn.fujitsu.com> writes: > The latest btrfs-progs include man page of btrfs-replace. Actually, you > could use it > something like: > > btrfs replace start | > > You could use 'btrfs file show' to see missing device id. and then run > btrfs replace. > Hi Wang, I physically removed the drive before the rebuild, having a failing device as a source is not a good idea anyway. Without the device in place, the device name is not showing up, since the missing device is not under /dev/sdXX or anything else. That is why I asked if the special parameter 'missing' may be used in a replace. I can't say if it is supported. But I guess not, since I found no documentation on this matter. So I guess replace is not aimed at fault tolerance / rebuilding. It's just a convenient way to lets lay replace the disks with larger disks , to extend your array. A convenience tool, not an emergency tool. TM > Thanks, > Wang -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted: > On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: >> >>> If you assume a 12ms average seek time (normal for 7200RPM SATA >>> drives), an 8.3ms rotational latency (half a rotation), an average >>> 64kb write and a 100MB/S streaming write speed, each write comes in >>> at ~21ms, which gives us ~47 IOPS. With the 64KB write size, this >>> comes out to ~3MB/S, DISK LIMITED. >> >>> The 5MB/S that TM is seeing is fine, considering the small files he >>> says he has. >> > That is actually nonsense. > Raid rebuild operates on the block/stripe layer and not on the > filesystem layer. If we were talking about a normal raid, yes. But we're talking about btrFS, note the FS for filesystem, so indeed it *IS* the filesystem layer. Now this particular "filesystem" /does/ happen to have raid properties as well, but it's definitely filesystem level... > It does not matter at all what the average file size is. ... and the filesize /does/ matter. > Raid rebuild is really only limited by disk i/o speed when performing a > linear read of the whole spindle using huge i/o sizes, > or, if you have multiple spindles on the same bus, the bus saturation > speed. Makes sense... if you're dealing at the raid level. If we were talking about dmraid or mdraid... and they're both much more mature and optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed be a reasonable expectation for them. But (barring bugs, which will and do happen at this stage of development) btrfs both makes far better data validity guarantees, and does a lot more complex processing what with COW and snapshotting, etc, of course in addition to the normal filesystem level stuff AND the raid-level stuff it does. > Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle, > when doing a raid rebuild. ... And perfectly reasonable, at least at this point, to expect ~5 MiB/ sec total thruput, one spindle at a time, for btrfs. > That is for the naive rebuild that rebuilds every single stripe. A > smarter rebuild that knows which stripes are unused can skip the unused > stripes and thus become even faster than that. > > > Now, that the rebuild is off by an order of magnitude is by design but > should be fixed at some stage, but with the current state of btrfs it is > probably better to focus on other more urgent areas first. Because of all the extra work it does, btrfs may never get to full streaming speed across all spindles at once. But it can and will certainly get much better than it is, once the focus moves to optimization. *AND*, because it /does/ know which areas of the device are actually in use, once btrfs is optimized, it's quite likely that despite the slower raw speed, because it won't have to deal with the unused area, at least with the typically 20-60% unused filesystems most people run, rebuild times will match or be faster than raid-layer-only technologies that must rebuild the entire device, because they do /not/ know which areas are unused. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/21/2014 10:00 PM, TM wrote: Wang Shilong cn.fujitsu.com> writes: Just my two cents: Since 'btrfs replace' support RADI10, I suppose using replace operation is better than 'device removal and add'. Another Question is related to btrfs snapshot-aware balance. How many snapshots did you have in your system? Of course, During balance/resize/device removal operations, you could still snapshot, but fewer snapshots should speed things up! Anyway 'btrfs replace' is implemented more effective than 'device remova and add'. Hi Wang, just one subvolume, no snaphots or anything else. device replace: to tell you the truth I have not used it in the past. Most of my testing was done 2 years ago. So in this 'kind of production' system I did not try it. But if I knew that it was faster, perhaps I could have used it. Anyone has statistics for such a replace and the time it takes? I don't have specific statistics about this. The conclusion come from implementation differences between replace and 'device removal'. Also, can replace be used when one device is missing? Cant find documentation. eg. btrfs replace start missing /dev/sdXX The latest btrfs-progs include man page of btrfs-replace. Actually, you could use it something like: btrfs replace start | You could use 'btrfs file show' to see missing device id. and then run btrfs replace. Thanks, Wang TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/21/2014 10:00 PM, TM wrote: Wang Shilong cn.fujitsu.com> writes: Just my two cents: Since 'btrfs replace' support RADI10, I suppose using replace operation is better than 'device removal and add'. Another Question is related to btrfs snapshot-aware balance. How many snapshots did you have in your system? Of course, During balance/resize/device removal operations, you could still snapshot, but fewer snapshots should speed things up! Anyway 'btrfs replace' is implemented more effective than 'device remova and add'. Hi Wang, just one subvolume, no snaphots or anything else. device replace: to tell you the truth I have not used it in the past. Most of my testing was done 2 years ago. So in this 'kind of production' system I did not try it. But if I knew that it was faster, perhaps I could have used it. Anyone has statistics for such a replace and the time it takes? I don't have specific statistics about this. The conclusion come from implementation differences between replace and 'device removal'. Also, can replace be used when one device is missing? Cant find documentation. eg. btrfs replace start missing /dev/sdXX TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Jul 21, 2014, at 10:46 AM, ronnie sahlberg wrote: > On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: >> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: >> >>> If you assume a 12ms average seek time (normal for 7200RPM SATA drives), >>> an 8.3ms rotational latency (half a rotation), an average 64kb write and >>> a 100MB/S streaming write speed, each write comes in at ~21ms, which >>> gives us ~47 IOPS. With the 64KB write size, this comes out to ~3MB/S, >>> DISK LIMITED. >> >>> The 5MB/S that TM is seeing is fine, considering the small files he says >>> he has. >> >> Thanks for the additional numbers supporting my point. =:^) >> >> I had run some of the numbers but not to the extent you just did, so I >> didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the >> range of expectation for spinning rust, given the current state of >> optimization... or more accurately the lack thereof, due to the focus >> still being on features. >> > > That is actually nonsense. > Raid rebuild operates on the block/stripe layer and not on the filesystem > layer. Not on Btrfs. It is on the filesystem layer. However, a rebuild is about replicating metadata (up to 256MB) and data (up to 1GB) chunks. For raid10, those are further broken down into 64KB strips. So the smallest size "unit" for replication during a rebuild on Btrfs would be 64KB. Anyway 5MB/s seems really low to me, so I'm suspicious something else is going on. I haven't done a rebuild in a couple months, but my recollection is it's always been as fast as the write performance of a single device in the btrfs volume. I'd be looking in dmesg for any of the physical drives being reset, having read or write errors, and I'd do some individual drive testing to see if the problem can be isolated. And if that's not helpful, well, this is really tedious and verbose amounts of information but it might reveal some issue is to capture actual commands going to physical devices: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34886.html My expectation (i.e. I'm guessing) based on previous testing is that whether raid1 or raid10, the actual read/write commands will each be 256KB in size. Btrfs rebuild is basically designed to be a sequential operation. This could maybe fall apart if there were somehow many minimally full chunks, which is probably unlikely. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote: > ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: > >> If you assume a 12ms average seek time (normal for 7200RPM SATA drives), >> an 8.3ms rotational latency (half a rotation), an average 64kb write and >> a 100MB/S streaming write speed, each write comes in at ~21ms, which >> gives us ~47 IOPS. With the 64KB write size, this comes out to ~3MB/S, >> DISK LIMITED. > >> The 5MB/S that TM is seeing is fine, considering the small files he says >> he has. > > Thanks for the additional numbers supporting my point. =:^) > > I had run some of the numbers but not to the extent you just did, so I > didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the > range of expectation for spinning rust, given the current state of > optimization... or more accurately the lack thereof, due to the focus > still being on features. > That is actually nonsense. Raid rebuild operates on the block/stripe layer and not on the filesystem layer. It does not matter at all what the average file size is. Raid rebuild is really only limited by disk i/o speed when performing a linear read of the whole spindle using huge i/o sizes, or, if you have multiple spindles on the same bus, the bus saturation speed. Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle, when doing a raid rebuild. That is for the naive rebuild that rebuilds every single stripe. A smarter rebuild that knows which stripes are unused can skip the unused stripes and thus become even faster than that. Now, that the rebuild is off by an order of magnitude is by design but should be fixed at some stage, but with the current state of btrfs it is probably better to focus on other more urgent areas first. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
Wang Shilong cn.fujitsu.com> writes: > Just my two cents: > > Since 'btrfs replace' support RADI10, I suppose using replace > operation is better than 'device removal and add'. > > Another Question is related to btrfs snapshot-aware balance. > How many snapshots did you have in your system? > > Of course, During balance/resize/device removal operations, > you could still snapshot, but fewer snapshots should speed things up! > > Anyway 'btrfs replace' is implemented more effective than > 'device remova and add'. > Hi Wang, just one subvolume, no snaphots or anything else. device replace: to tell you the truth I have not used it in the past. Most of my testing was done 2 years ago. So in this 'kind of production' system I did not try it. But if I knew that it was faster, perhaps I could have used it. Anyone has statistics for such a replace and the time it takes? Also, can replace be used when one device is missing? Cant find documentation. eg. btrfs replace start missing /dev/sdXX TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted: > If you assume a 12ms average seek time (normal for 7200RPM SATA drives), > an 8.3ms rotational latency (half a rotation), an average 64kb write and > a 100MB/S streaming write speed, each write comes in at ~21ms, which > gives us ~47 IOPS. With the 64KB write size, this comes out to ~3MB/S, > DISK LIMITED. > The 5MB/S that TM is seeing is fine, considering the small files he says > he has. Thanks for the additional numbers supporting my point. =:^) I had run some of the numbers but not to the extent you just did, so I didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the range of expectation for spinning rust, given the current state of optimization... or more accurately the lack thereof, due to the focus still being on features. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
Hi, On 07/20/2014 04:45 PM, TM wrote: Hi, I have a raid10 with 4x 3TB disks on a microserver http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM Recently one disk started to fail (smart errors), so I replaced it Mounted as degraded, added new disk, removed old Started yesterday I am monitoring /var/log/messages and it seems it will take a long time Started at about 8010631739392 And 20 hours later I am at 6910631739392 btrfs: relocating block group 6910631739392 flags 65 At this rate it will take a week to complete the raid rebuild!!! Just my two cents: Since 'btrfs replace' support RADI10, I suppose using replace operation is better than 'device removal and add'. Another Question is related to btrfs snapshot-aware balance. How many snapshots did you have in your system? Of course, During balance/resize/device removal operations, you could still snapshot, but fewer snapshots should speed things up! Anyway 'btrfs replace' is implemented more effective than 'device remova and add'.:-) Thanks, Wang Furthermore it seems that the operation is getting slower and slower When the rebuild started I had a new message every half a minute, now it’s getting to OneAndHalf minutes Most files are small files like flac/jpeg One week for a raid10 rebuild 4x3TB drives is a very long time. Any thoughts? Can you share any statistics from your RAID10 rebuilds? If I shut down the system, before the rebuild, what is the proper procedure to remount it? Again degraded? Or normally? Can the process of rebuilding the raid continue after a reboot? Will it survive, and continue rebuilding? Thanks in advance TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/20/2014 02:28 PM, Bob Marley wrote: On 20/07/2014 21:36, Roman Mamedov wrote: On Sun, 20 Jul 2014 21:15:31 +0200 Bob Marley wrote: Hi TM, are you doing other significant filesystem activity during this rebuild, especially random accesses? This can reduce performances a lot on HDDs. E.g. if you were doing strenous multithreaded random writes in the meanwhile, I could expect even less than 5MB/sec overall... I believe the problem here might be that a Btrfs rebuild *is* a strenuous random read (+ random-ish write) just by itself. Mdadm-based RAID would rebuild the array reading/writing disks in a completely linear manner, and it would finish an order of magnitude faster. Now this explains a lot! So they would just need to be sorted? Sorting the files of a disk from lowest to highers block number prior to starting reconstruction seems feasible. Maybe not all of them together because they will be millions, but sorting them in chunks of 1000 files would still produce a very significant speedup! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html As I understand the problem, it has to do with where btrfs is in the overall development process. There are a LOT of opportunities for optimization, but optimization cannot begin until btrfs is feature complete, because any work done beforehand would be wasted effort in that it would likely have to be repeated after being broken by feature enhancements. So now it is a waiting game for completion of all the major features (like additional RAID levels and possible n-way options, etc) before optimization efforts can begin. Once that happens we will likely see HUGE gains in efficiency and speed, but until then we are kind of stuck in this position where it "works" but leaves somewhat to be desired. I think this is one reason developers often caution users not to expect too much from btrfs at this point. Its just not there yet and it will still be some time yet before it is. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 20/07/2014 21:36, Roman Mamedov wrote: On Sun, 20 Jul 2014 21:15:31 +0200 Bob Marley wrote: Hi TM, are you doing other significant filesystem activity during this rebuild, especially random accesses? This can reduce performances a lot on HDDs. E.g. if you were doing strenous multithreaded random writes in the meanwhile, I could expect even less than 5MB/sec overall... I believe the problem here might be that a Btrfs rebuild *is* a strenuous random read (+ random-ish write) just by itself. Mdadm-based RAID would rebuild the array reading/writing disks in a completely linear manner, and it would finish an order of magnitude faster. Now this explains a lot! So they would just need to be sorted? Sorting the files of a disk from lowest to highers block number prior to starting reconstruction seems feasible. Maybe not all of them together because they will be millions, but sorting them in chunks of 1000 files would still produce a very significant speedup! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
This is the cause for the slow reconstruct. > I believe the problem here might be that a Btrfs rebuild *is* a strenuous > random read (+ random-ish write) just by itself. If you assume a 12ms average seek time (normal for 7200RPM SATA drives), an 8.3ms rotational latency (half a rotation), an average 64kb write and a 100MB/S streaming write speed, each write comes in at ~21ms, which gives us ~47 IOPS. With the 64KB write size, this comes out to ~3MB/S, DISK LIMITED. The on-disk cache helps a bit during the startup, but once the cache is full, it's back to writes at disk speed, with some small gains if the on-disk controller can schedule the writes efficiently. Based on the single-threaded I/O that BTRFS uses during a reconstruct, I expect that the average write size is somewhere around 200KB. Multi-threading the reconstruct disk I/O (possibly adding look-ahead), would double the reconstruct speed for this array, but that's not a trivial task. The 5MB/S that TM is seeing is fine, considering the small files he says he has. Peter Ashford -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Sun, 20 Jul 2014 21:15:31 +0200 Bob Marley wrote: > Hi TM, are you doing other significant filesystem activity during this > rebuild, especially random accesses? > This can reduce performances a lot on HDDs. > E.g. if you were doing strenous multithreaded random writes in the > meanwhile, I could expect even less than 5MB/sec overall... I believe the problem here might be that a Btrfs rebuild *is* a strenuous random read (+ random-ish write) just by itself. Mdadm-based RAID would rebuild the array reading/writing disks in a completely linear manner, and it would finish an order of magnitude faster. -- With respect, Roman signature.asc Description: PGP signature
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 20/07/2014 10:45, TM wrote: Hi, I have a raid10 with 4x 3TB disks on a microserver http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM Recently one disk started to fail (smart errors), so I replaced it Mounted as degraded, added new disk, removed old Started yesterday I am monitoring /var/log/messages and it seems it will take a long time Started at about 8010631739392 And 20 hours later I am at 6910631739392 btrfs: relocating block group 6910631739392 flags 65 At this rate it will take a week to complete the raid rebuild!!! Furthermore it seems that the operation is getting slower and slower When the rebuild started I had a new message every half a minute, now it’s getting to OneAndHalf minutes Most files are small files like flac/jpeg Hi TM, are you doing other significant filesystem activity during this rebuild, especially random accesses? This can reduce performances a lot on HDDs. E.g. if you were doing strenous multithreaded random writes in the meanwhile, I could expect even less than 5MB/sec overall... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
whisperpc.com> writes: > > Finally, TM didn't mention anything about other I/O activity on the array, > which, regardless of the method of reconstruction, could have a > significant impact on the speed of a reconstruction. > > There are a LOT of parameters here that could impact throughput. Some are > designed in (the checksum computations), some are "temporary" (the single > I/O path) and some are end-user issues (slow CPU and other activity on the > array). I'm sure that there are other parameters, possibly including soft > read errors on the other disks, that could be impacting the overall > throughput. > > As all the information that could affect performance hasnt been provided > yet, it is premature to make a blanket statement that the performance of a > reconstruction is "unreasonable". For the circumstances, it's possible > that the performance is just fine. We have not yet been provided with > enough data to verify this, one way or another. > > Peter Ashford Hi Peter, I am monitoring closely, I do not have any 'soft-errors' or anything else. No other users are using the raid10 array. No other compute/disk intensive tasks etc. So I am providing data - usual and typical data at least. To answer another question, yes I do have backups. So this I guess it the speed most users get, which is in my opinion very slow. And I guess that's around the speed most users will get in similar 4x 3TB build with low power CPUs. Long recovery time, leads to higher risk of another disk failure or other unexpected failures. TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
whisperpc.com> writes: > > Finally, TM didn't mention anything about other I/O activity on the array, > which, regardless of the method of reconstruction, could have a > significant impact on the speed of a reconstruction. > > There are a LOT of parameters here that could impact throughput. Some are > designed in (the checksum computations), some are "temporary" (the single > I/O path) and some are end-user issues (slow CPU and other activity on the > array). I'm sure that there are other parameters, possibly including soft > read errors on the other disks, that could be impacting the overall > throughput. > > As all the information that could affect performance hasnt been provided > yet, it is premature to make a blanket statement that the performance of a > reconstruction is "unreasonable". For the circumstances, it's possible > that the performance is just fine. We have not yet been provided with > enough data to verify this, one way or another. > > Peter Ashford Hi Peter, I am monitoring closely, I do not have any 'soft-errors' or anything else. No other users are using the raid10 array. No other compute/disk intensive tasks etc. So I am providing data - usual and typical data at least. To answer another question, yes I do have backups. So this I guess it the speed most users get, which is in my opinion very slow. And I guess that's around the speed most users will get in similar 4x 3TB build with low power CPUs. Long recovery time, leads to higher risk of another disk failure or other unexpected failures. TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
Tomasz, > On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote: >> TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted: >> >>> One week for a raid10 rebuild 4x3TB drives is a very long time. >>> Any thoughts? >>> Can you share any statistics from your RAID10 rebuilds? >> >> At a week, that's nearly 5 MiB per second, which isn't great, but isn't >> entirely out of the realm of reason either, given all the processing >> it's >> doing. A day would be 33.11+, reasonable thruput for a straight copy, >> and a raid rebuild is rather more complex than a straight copy, so... > > Uhm, sorry, but 5MBps is _entirely_ unreasonable. It is > order-of-magnitude > unreasonable. And "all the processing" shouldn't even show as a blip > on modern CPUs. > This "speed" is undefendable. Maybe and maybe not. It's already been discussed that BTRFS serializes its disk I/O, instead of performing it in parallel. That can have a significant impact on overall throughput. In addition, depending on how BTRFS does a reconstruct, it's possible that there's another problem. Most low-level RAID implementations (MD driver and RAID controller cards) reconstruct a segment at a time (or possibly multiple segments), and reconstruct the entire disk, even the unused portions. ZFS is different, in that it reconstructs only the used sections. This behavior will read/write less to the disk, but it will have more random I/O. The random I/O can result in more seeks (depending on how busy the system is), which could slow down the reconstruction, especially on an idle system. As ZFS was designed for high usage environments, where it's likely that a seek would be needed to get the head back to the data being reconstructed, the randomness of the reconstruct would have no significant impact on a reconstruct. The question implied above is which method of reconstruction BTRFS uses. Could someone please inform us? As BTRFS is verifying the checksum on the read, and possibly generating a new checksum on write, the CPU time used will be significantly greater than for a "dumb" RAID-10 array. If it's not a sequential segment reconstruction, the CPU time needed would be greater still. WRT CPU performance, we should remember that TM's system is far from a speed demon. Finally, TM didn't mention anything about other I/O activity on the array, which, regardless of the method of reconstruction, could have a significant impact on the speed of a reconstruction. There are a LOT of parameters here that could impact throughput. Some are designed in (the checksum computations), some are "temporary" (the single I/O path) and some are end-user issues (slow CPU and other activity on the array). I'm sure that there are other parameters, possibly including soft read errors on the other disks, that could be impacting the overall throughput. As all the information that could affect performance hasnt been provided yet, it is premature to make a blanket statement that the performance of a reconstruction is "unreasonable". For the circumstances, it's possible that the performance is just fine. We have not yet been provided with enough data to verify this, one way or another. Peter Ashford -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/20/2014 10:00 AM, Tomasz Torcz wrote: > On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote: >> TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted: >> >>> One week for a raid10 rebuild 4x3TB drives is a very long time. >>> Any thoughts? >>> Can you share any statistics from your RAID10 rebuilds? >> >> >> At a week, that's nearly 5 MiB per second, which isn't great, but isn't >> entirely out of the realm of reason either, given all the processing it's >> doing. A day would be 33.11+, reasonable thruput for a straight copy, >> and a raid rebuild is rather more complex than a straight copy, so... > > Uhm, sorry, but 5MBps is _entirely_ unreasonable. It is order-of-magnitude > unreasonable. And "all the processing" shouldn't even show as a blip > on modern CPUs. > This "speed" is undefendable. > I wholly agree that it's undefendable, but I can tell you why it is so slow, it's not 'all the processing' (which is maybe a few hundred instructions on x86 for each block), it's because BTRFS still serializes writes to devices, instead of queuing all of them in parallel (that is, when there are four devices that need written to, it writes to each one in sequence, waiting for the previous write to finish before dispatching the next write). Personally, I would love to see this behavior improved, but I really don't have any time to work on it myself. smime.p7s Description: S/MIME Cryptographic Signature
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote: > TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted: > > > One week for a raid10 rebuild 4x3TB drives is a very long time. > > Any thoughts? > > Can you share any statistics from your RAID10 rebuilds? > > > At a week, that's nearly 5 MiB per second, which isn't great, but isn't > entirely out of the realm of reason either, given all the processing it's > doing. A day would be 33.11+, reasonable thruput for a straight copy, > and a raid rebuild is rather more complex than a straight copy, so... Uhm, sorry, but 5MBps is _entirely_ unreasonable. It is order-of-magnitude unreasonable. And "all the processing" shouldn't even show as a blip on modern CPUs. This "speed" is undefendable. -- Tomasz Torcz RIP is irrevelant. Spoofing is futile. xmpp: zdzich...@chrome.pl Your routes will be aggreggated. -- Alex Yuriev -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted: > One week for a raid10 rebuild 4x3TB drives is a very long time. > Any thoughts? > Can you share any statistics from your RAID10 rebuilds? Well, 3 TB is big and spinning rust is slow. Even using the smaller power-of-10 (1000) figures for TB that the device manufacturers use instead of the power-of-2 (1024, TiB) figures that are common in computing... TB GB MB KB BKiB MiB s/hr day wk 3*1000*1000*1000*1000/1024/1024/3600/24/7=4.73+ MiB/sec At a week, that's nearly 5 MiB per second, which isn't great, but isn't entirely out of the realm of reason either, given all the processing it's doing. A day would be 33.11+, reasonable thruput for a straight copy, and a raid rebuild is rather more complex than a straight copy, so... Which is one reason a lot of people are using partitioning to break down those huge numbers into something a bit more manageable in reasonable time, these days, or switching to much faster if also much more expensive per GiB (USD 50 cents to $1 per gig vs 5-10 cents per gig) SSDs. And btrfs is still under development so hasn't been really optimized yet and is thus slower than necessary -- in particular, it often serializes multi-device processing where given that the bottleneck is normally device IO, an optimized algorithm would parallel-process all devices at once. Just parallelizing the algorithm could give it a 2-4X speed increase on a 4-device raid10. So you're right, it /is/ slow. > If I shut down the system, before the rebuild, what is the proper > procedure to remount it? Again degraded? Or normally? Can the process of > rebuilding the raid continue after a reboot? Will it survive, and > continue rebuilding? Raid10 requires four devices for undegraded layout, but I /think/ once it has a forth device added back in, you should be able to mount it undegraded, as it can write changes to four devices at that point. Tho I'm not positive about that. I'd try mounting it undegraded here and if it worked, great, if not I'd mount it degraded again. Regardless of that, however, barring bugs, provided you shut down properly (umounting the filesystem, etc), a shutdown and reboot should be fine, and it should continue where it left off after the reboot, as internally it's simply doing a rebalance of existing data to include the new device, and btrfs is designed to gracefully shutdown in the middle of a rebalance and restart it on reboot, when necessary. Tho don't expect umount and shutdown to be instantaneous. After you issue the umount command, it shouldn't start any new chunk balances, but it could require a bit to finish the balances in-flight at the time you issued the shutdown. If it takes more than a few minutes, however, there's a bug. FWIW, data chunks are a GiB in size, which at the calculated rate of a bit under 5 MiB/sec, should be ~205 seconds or roughly 3.5 minutes. Doubling that and a bit more to be safe, I'd say wait 10 minutes or so. If it hasn't properly umounted after 10 minutes, you likely have a bug and may have to recover after a reboot. With btrfs still under heavy development backups are STRONGLY recommended so I hope you have them, but at this point anyway, while it's slow going there's no indication that you'll actually need to use those backups. Just expect the umount to take a bit. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
1 week to rebuid 4x 3TB raid10 is a long time!
Hi, I have a raid10 with 4x 3TB disks on a microserver http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM Recently one disk started to fail (smart errors), so I replaced it Mounted as degraded, added new disk, removed old Started yesterday I am monitoring /var/log/messages and it seems it will take a long time Started at about 8010631739392 And 20 hours later I am at 6910631739392 btrfs: relocating block group 6910631739392 flags 65 At this rate it will take a week to complete the raid rebuild!!! Furthermore it seems that the operation is getting slower and slower When the rebuild started I had a new message every half a minute, now it’s getting to OneAndHalf minutes Most files are small files like flac/jpeg One week for a raid10 rebuild 4x3TB drives is a very long time. Any thoughts? Can you share any statistics from your RAID10 rebuilds? If I shut down the system, before the rebuild, what is the proper procedure to remount it? Again degraded? Or normally? Can the process of rebuilding the raid continue after a reboot? Will it survive, and continue rebuilding? Thanks in advance TM -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html