Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-24 Thread Chris Murphy

On Jul 22, 2014, at 11:13 AM, Chris Murphy li...@colorremedies.com wrote:
 
 It's been a while since I did a rebuild on HDDs, 

So I did this yesterday and day before with an SSD and HDD in raid1, and made 
the HDD do the rebuild. 


Baseline for this hard drive:
hdparm -t
35.68 MB/sec

dd if=/dev/zero of=/dev/rdisk2s1 bs=256k
13508091392 bytes transferred in 521.244920 secs (25915056 bytes/sec)

I don't know why hdparm gets such good reads, and dd writes are 75% of that, 
but the 26MB/s write speed is realistic (this is a Firewire 400 external 
device) and what I typically get with long sequential writes. It's probable 
this is interface limited to mode S200, not a drive limitation since on SATA 
Rev 2 or 3 interface I get 100+MB/s transfers.

During the rebuild, iotop reports actual write averaging in the 24MB/s range, 
and the total data to restore divided by total time for the replace command 
comes out to 23MB/s. The source data is a Fedora 21 install with no meaningful 
user data (cache files and such), so mostly a bunch of libraries, programs, and 
documentation. Therefore it's not exclusively small files, yet the iotop rate 
was very stable throughout the 4 minute rebuild.

So I still think 5MB/s for a SATA connected (?) drive is to be unexpected.

 
Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread TM
Wang Shilong wangsl.fnst at cn.fujitsu.com writes:


 The latest btrfs-progs include man page of btrfs-replace. Actually, you 
 could use it
 something like:
 
 btrfs replace start  srcdev|devid targetdev mnt
 
 You could use 'btrfs file show' to see missing device id. and then run 
 btrfs replace.
 

Hi Wang,

  I physically removed the drive before the rebuild, having a failing device
as a source is not a good idea anyway.
  Without the device in place, the device name is not showing up, since the
missing device is not under /dev/sdXX or anything else. 

  That is why I asked if the special parameter 'missing' may be used in a
replace. I can't say if it is supported. But I guess not, since I found no
documentation on this matter.

  So I guess replace is not aimed at fault tolerance / rebuilding. It's just
a convenient way to lets lay replace the disks with larger disks , to extend
your array. A convenience tool, not an emergency tool.

TM

 Thanks,
 Wang


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread Stefan Behrens
On Tue, 22 Jul 2014 14:43:45 + (UTC), Tm wrote:
 Wang Shilong wangsl.fnst at cn.fujitsu.com writes:
 
 
 The latest btrfs-progs include man page of btrfs-replace. Actually, you 
 could use it
 something like:

 btrfs replace start  srcdev|devid targetdev mnt

 You could use 'btrfs file show' to see missing device id. and then run 
 btrfs replace.

 
 Hi Wang,
 
   I physically removed the drive before the rebuild, having a failing device
 as a source is not a good idea anyway.
   Without the device in place, the device name is not showing up, since the
 missing device is not under /dev/sdXX or anything else. 
 
   That is why I asked if the special parameter 'missing' may be used in a
 replace. I can't say if it is supported. But I guess not, since I found no
 documentation on this matter.
 
   So I guess replace is not aimed at fault tolerance / rebuilding. It's just
 a convenient way to lets lay replace the disks with larger disks , to extend
 your array. A convenience tool, not an emergency tool.

TM, Just read the man-page. You could have used the replace tool after
physically removing the failing device.

Quoting the man page:
If the source device is not available anymore, or if the -r option is
set, the data is built only using the RAID redundancy mechanisms.

Options
-r   only read from srcdev if no other zero-defect mirror
 exists (enable this if your drive has lots of read errors,
 the access would be very slow)


Concerning the rebuild performance, the access to the disk is linear for
both reading and writing, I measured above 75 MByte/s at that time with
regular 7200 RPM disks, which would be less than 10 hours to replace a
3TB disk (in worst case, if it is completely filled up).
Unused/unallocated areas are skipped and additionally improve the
rebuild speed.

For missing disks, unfortunately the command invocation is not using the
term missing but the numerical device-id instead of the device name.
missing _is_ implemented in the kernel part of the replace code, but
was simply forgotten in the user mode part, at least it was forgotten in
the man page.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread Chris Murphy

On Jul 21, 2014, at 8:51 PM, Duncan 1i5t5.dun...@cox.net wrote:

 
 It does not matter at all what the average file size is.
 
 … and the filesize /does/ matter.

I'm not sure how. A rebuild is replicating chunks, not doing the equivalent of 
cp or rsync on files. Copying chunks (or strips of chunks in the case of 
raid10) should be a rather sequential operation. So I'm not sure where the 
random write behavior would come from that could drop the write performance to 
~5MB/s on drives that can read/write ~100MB/s.

 Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
 when doing a raid rebuild.
 
 ... And perfectly reasonable, at least at this point, to expect ~5 MiB/
 sec total thruput, one spindle at a time, for btrfs.

It's been a while since I did a rebuild on HDDs, but on SSDs the rebuilds have 
maxed out the replacement drive. Obviously the significant difference is 
rotational latency. If everyone with spinning disks and many small files is 
getting 5MB/s rebuilds, it suggests a rotational latency penalty if the 
performance is expected. I'm just not sure where that would be coming from. 
Random IO would incur the effect of rotational latency, but the rebuild 
shouldn't be random IO, rather sequential.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread TM
Stefan Behrens sbehrens at giantdisaster.de writes:


 TM, Just read the man-page. You could have used the replace tool after
 physically removing the failing device.
 
 Quoting the man page:
 If the source device is not available anymore, or if the -r option is
 set, the data is built only using the RAID redundancy mechanisms.
 
 Options
 -r   only read from srcdev if no other zero-defect mirror
  exists (enable this if your drive has lots of read errors,
  the access would be very slow)
 
 Concerning the rebuild performance, the access to the disk is linear for
 both reading and writing, I measured above 75 MByte/s at that time with
 regular 7200 RPM disks, which would be less than 10 hours to replace a
 3TB disk (in worst case, if it is completely filled up).
 Unused/unallocated areas are skipped and additionally improve the
 rebuild speed.
 
 For missing disks, unfortunately the command invocation is not using the
 term missing but the numerical device-id instead of the device name.
 missing _is_ implemented in the kernel part of the replace code, but
 was simply forgotten in the user mode part, at least it was forgotten in
 the man page.
 

Hi Stefan,
thank you very much, for the comprehensive info, I will opt to use replace
next time.

Breaking news :-) 
from Jul 19 14:41:36 microserver kernel: [ 1134.244007] btrfs: relocating
block group 8974430633984 flags 68
to  Jul 22 16:54:54 microserver kernel: [268419.463433] btrfs: relocating
block group 2991474081792 flags 65

Rebuild ended before counting down to 
So flight time was 3 days, and I see no more messages or btrfs processes
utilizing cpu. So rebuild seams ready.
Just a few hours ago another disk showed some earlly touble accumulating
Current_Pending_Sector but no Reallocated_Sector_Ct yet.


TM

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread TM
Wang Shilong wangsl.fnst at cn.fujitsu.com writes:

 Just my two cents:
 
 Since 'btrfs replace' support RADI10, I suppose using replace
 operation is better than 'device removal and add'.
 
 Another Question is related to btrfs snapshot-aware balance.
 How many snapshots did you have in your system?
 
 Of course, During balance/resize/device removal operations,
 you could still snapshot, but fewer snapshots should speed things up!
 
 Anyway 'btrfs replace' is implemented more effective than
 'device remova and add'.
 


Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX


TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread ronnie sahlberg
On Sun, Jul 20, 2014 at 7:48 PM, Duncan 1i5t5.dun...@cox.net wrote:
 ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:

 If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
 an 8.3ms rotational latency (half a rotation), an average 64kb write and
 a 100MB/S streaming write speed, each write comes in at ~21ms, which
 gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
 DISK LIMITED.

 The 5MB/S that TM is seeing is fine, considering the small files he says
 he has.

 Thanks for the additional numbers supporting my point. =:^)

 I had run some of the numbers but not to the extent you just did, so I
 didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
 range of expectation for spinning rust, given the current state of
 optimization... or more accurately the lack thereof, due to the focus
 still being on features.


That is actually nonsense.
Raid rebuild operates on the block/stripe layer and not on the filesystem layer.
It does not matter at all what the average file size is.

Raid rebuild is really only limited by disk i/o speed when performing
a linear read of the whole spindle using huge i/o sizes,
or, if you have multiple spindles on the same bus, the bus saturation speed.

Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
when doing a raid rebuild.
That is for the naive rebuild that rebuilds every single stripe. A
smarter rebuild that knows which stripes are unused can skip the
unused stripes and thus become even faster than that.


Now, that the rebuild is off by an order of magnitude is by design but
should be fixed at some stage, but with the current state of btrfs it
is probably better to focus on other more urgent areas first.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Chris Murphy

On Jul 21, 2014, at 10:46 AM, ronnie sahlberg ronniesahlb...@gmail.com wrote:

 On Sun, Jul 20, 2014 at 7:48 PM, Duncan 1i5t5.dun...@cox.net wrote:
 ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
 
 If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
 an 8.3ms rotational latency (half a rotation), an average 64kb write and
 a 100MB/S streaming write speed, each write comes in at ~21ms, which
 gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
 DISK LIMITED.
 
 The 5MB/S that TM is seeing is fine, considering the small files he says
 he has.
 
 Thanks for the additional numbers supporting my point. =:^)
 
 I had run some of the numbers but not to the extent you just did, so I
 didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
 range of expectation for spinning rust, given the current state of
 optimization... or more accurately the lack thereof, due to the focus
 still being on features.
 
 
 That is actually nonsense.
 Raid rebuild operates on the block/stripe layer and not on the filesystem 
 layer.

Not on Btrfs. It is on the filesystem layer. However, a rebuild is about 
replicating metadata (up to 256MB) and data (up to 1GB) chunks. For raid10, 
those are further broken down into 64KB strips. So the smallest size unit for 
replication during a rebuild on Btrfs would be 64KB.

Anyway 5MB/s seems really low to me, so I'm suspicious something else is going 
on. I haven't done a rebuild in a couple months, but my recollection is it's 
always been as fast as the write performance of a single device in the btrfs 
volume.

I'd be looking in dmesg for any of the physical drives being reset, having read 
or write errors, and I'd do some individual drive testing to see if the problem 
can be isolated. And if that's not helpful, well, this is really tedious and 
verbose amounts of information but it might reveal some issue is to capture 
actual commands going to physical devices:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34886.html

My expectation (i.e. I'm guessing) based on previous testing is that whether 
raid1 or raid10, the actual read/write commands will each be 256KB in size. 
Btrfs rebuild is basically designed to be a sequential operation. This could 
maybe fall apart if there were somehow many minimally full chunks, which is 
probably unlikely.

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong

On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong wangsl.fnst at cn.fujitsu.com writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.




Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX


TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong

On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong wangsl.fnst at cn.fujitsu.com writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.



Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX
The latest btrfs-progs include man page of btrfs-replace. Actually, you 
could use it

something like:

btrfs replace start  srcdev|devid targetdev mnt

You could use 'btrfs file show' to see missing device id. and then run 
btrfs replace.


Thanks,
Wang



TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Duncan
ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted:

 On Sun, Jul 20, 2014 at 7:48 PM, Duncan 1i5t5.dun...@cox.net wrote:
 ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:

 If you assume a 12ms average seek time (normal for 7200RPM SATA
 drives), an 8.3ms rotational latency (half a rotation), an average
 64kb write and a 100MB/S streaming write speed, each write comes in
 at ~21ms, which gives us ~47 IOPS.  With the 64KB write size, this
 comes out to ~3MB/S, DISK LIMITED.

 The 5MB/S that TM is seeing is fine, considering the small files he
 says he has.

 That is actually nonsense.
 Raid rebuild operates on the block/stripe layer and not on the
 filesystem layer.

If we were talking about a normal raid, yes.  But we're talking about 
btrFS, note the FS for filesystem, so indeed it *IS* the filesystem 
layer.  Now this particular filesystem /does/ happen to have raid 
properties as well, but it's definitely filesystem level...

 It does not matter at all what the average file size is.

... and the filesize /does/ matter.

 Raid rebuild is really only limited by disk i/o speed when performing a
 linear read of the whole spindle using huge i/o sizes,
 or, if you have multiple spindles on the same bus, the bus saturation
 speed.

Makes sense... if you're dealing at the raid level.  If we were talking 
about dmraid or mdraid... and they're both much more mature and 
optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed 
be a reasonable expectation for them.

But (barring bugs, which will and do happen at this stage of development) 
btrfs both makes far better data validity guarantees, and does a lot more 
complex processing what with COW and snapshotting, etc, of course in 
addition to the normal filesystem level stuff AND the raid-level stuff it 
does.

 Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
 when doing a raid rebuild.

... And perfectly reasonable, at least at this point, to expect ~5 MiB/
sec total thruput, one spindle at a time, for btrfs.

 That is for the naive rebuild that rebuilds every single stripe. A
 smarter rebuild that knows which stripes are unused can skip the unused
 stripes and thus become even faster than that.
 
 
 Now, that the rebuild is off by an order of magnitude is by design but
 should be fixed at some stage, but with the current state of btrfs it is
 probably better to focus on other more urgent areas first.

Because of all the extra work it does, btrfs may never get to full 
streaming speed across all spindles at once.  But it can and will 
certainly get much better than it is, once the focus moves to 
optimization.  *AND*, because it /does/ know which areas of the device 
are actually in use, once btrfs is optimized, it's quite likely that 
despite the slower raw speed, because it won't have to deal with the 
unused area, at least with the typically 20-60% unused filesystems most 
people run, rebuild times will match or be faster than raid-layer-only 
technologies that must rebuild the entire device, because they do /not/ 
know which areas are unused.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Austin S Hemmelgarn
On 07/20/2014 10:00 AM, Tomasz Torcz wrote:
 On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote:
 TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:

 One week for a raid10 rebuild 4x3TB drives is a very long time.
 Any thoughts?
 Can you share any statistics from your RAID10 rebuilds?


 At a week, that's nearly 5 MiB per second, which isn't great, but isn't 
 entirely out of the realm of reason either, given all the processing it's 
 doing.  A day would be 33.11+, reasonable thruput for a straight copy, 
 and a raid rebuild is rather more complex than a straight copy, so...
 
   Uhm, sorry, but 5MBps is _entirely_ unreasonable.  It is order-of-magnitude
 unreasonable.  And all the processing shouldn't even show as a blip
 on modern CPUs.
   This speed is undefendable.
 
I wholly agree that it's undefendable, but I can tell you why it is so
slow, it's not 'all the processing' (which is maybe a few hundred
instructions on x86 for each block), it's because BTRFS still serializes
writes to devices, instead of queuing all of them in parallel (that is,
when there are four devices that need written to, it writes to each one
in sequence, waiting for the previous write to finish before dispatching
the next write).  Personally, I would love to see this behavior
improved, but I really don't have any time to work on it myself.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Bob Marley

On 20/07/2014 10:45, TM wrote:

Hi,

I have a raid10 with 4x 3TB disks on a microserver
http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM

Recently one disk started to fail (smart errors), so I replaced it
Mounted as degraded, added new disk, removed old
Started yesterday
I am monitoring /var/log/messages and it seems it will take a long time
Started at about 8010631739392
And 20 hours later I am at 6910631739392
btrfs: relocating block group 6910631739392 flags 65

At this rate it will take a week to complete the raid rebuild!!!

Furthermore it seems that the operation is getting slower and slower
When the rebuild started I had a new message every half a minute, now it’s
getting to OneAndHalf minutes
Most files are small files like flac/jpeg



Hi TM, are you doing other significant filesystem activity during this 
rebuild, especially random accesses?

This can reduce performances a lot on HDDs.
E.g. if you were doing strenous multithreaded random writes in the 
meanwhile, I could expect even less than 5MB/sec overall...


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Roman Mamedov
On Sun, 20 Jul 2014 21:15:31 +0200
Bob Marley bobmar...@shiftmail.org wrote:

 Hi TM, are you doing other significant filesystem activity during this 
 rebuild, especially random accesses?
 This can reduce performances a lot on HDDs.
 E.g. if you were doing strenous multithreaded random writes in the 
 meanwhile, I could expect even less than 5MB/sec overall...

I believe the problem here might be that a Btrfs rebuild *is* a strenuous
random read (+ random-ish write) just by itself.

Mdadm-based RAID would rebuild the array reading/writing disks in a completely
linear manner, and it would finish an order of magnitude faster.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread ashford
This is the cause for the slow reconstruct.

 I believe the problem here might be that a Btrfs rebuild *is* a strenuous
 random read (+ random-ish write) just by itself.

If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
an 8.3ms rotational latency (half a rotation), an average 64kb write and a
100MB/S streaming write speed, each write comes in at ~21ms, which gives
us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S, DISK
LIMITED.

The on-disk cache helps a bit during the startup, but once the cache is
full, it's back to writes at disk speed, with some small gains if the
on-disk controller can schedule the writes efficiently.

Based on the single-threaded I/O that BTRFS uses during a reconstruct, I
expect that the average write size is somewhere around 200KB. 
Multi-threading the reconstruct disk I/O (possibly adding look-ahead),
would double the reconstruct speed for this array, but that's not a
trivial task.

The 5MB/S that TM is seeing is fine, considering the small files he says
he has.

Peter Ashford

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread George Mitchell

On 07/20/2014 02:28 PM, Bob Marley wrote:

On 20/07/2014 21:36, Roman Mamedov wrote:

On Sun, 20 Jul 2014 21:15:31 +0200
Bob Marley bobmar...@shiftmail.org wrote:


Hi TM, are you doing other significant filesystem activity during this
rebuild, especially random accesses?
This can reduce performances a lot on HDDs.
E.g. if you were doing strenous multithreaded random writes in the
meanwhile, I could expect even less than 5MB/sec overall...
I believe the problem here might be that a Btrfs rebuild *is* a 
strenuous

random read (+ random-ish write) just by itself.

Mdadm-based RAID would rebuild the array reading/writing disks in a 
completely

linear manner, and it would finish an order of magnitude faster.


Now this explains a lot!
So they would just need to be sorted?
Sorting the files of a disk from lowest to highers block number prior 
to starting reconstruction seems feasible. Maybe not all of them 
together because they will be millions, but sorting them in chunks of 
1000 files would still produce a very significant speedup!

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


As I understand the problem, it has to do with where btrfs is in the 
overall development process.  There are a LOT of opportunities for 
optimization, but optimization cannot begin until btrfs is feature 
complete, because any work done beforehand would be wasted effort in 
that it would likely have to be repeated after being broken by feature 
enhancements.  So now it is a waiting game for completion of all the 
major features (like additional RAID levels and possible n-way options, 
etc) before optimization efforts can begin.  Once that happens we will 
likely see HUGE gains in efficiency and speed, but until then we are 
kind of stuck in this position where it works but leaves somewhat to 
be desired.  I think this is one reason developers often caution users 
not to expect too much from btrfs at this point.  Its just not there yet 
and it will still be some time yet before it is.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Wang Shilong

Hi,

On 07/20/2014 04:45 PM, TM wrote:

Hi,

I have a raid10 with 4x 3TB disks on a microserver
http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM

Recently one disk started to fail (smart errors), so I replaced it
Mounted as degraded, added new disk, removed old
Started yesterday
I am monitoring /var/log/messages and it seems it will take a long time
Started at about 8010631739392
And 20 hours later I am at 6910631739392
btrfs: relocating block group 6910631739392 flags 65

At this rate it will take a week to complete the raid rebuild!!!

Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.:-)

Thanks,
Wang


Furthermore it seems that the operation is getting slower and slower
When the rebuild started I had a new message every half a minute, now it’s
getting to OneAndHalf minutes
Most files are small files like flac/jpeg

One week for a raid10 rebuild 4x3TB drives is a very long time.
Any thoughts?
Can you share any statistics from your RAID10 rebuilds?

If I shut down the system, before the rebuild, what is the proper procedure
to remount it? Again degraded? Or normally? Can the process of rebuilding
the raid continue after a reboot? Will it survive, and continue rebuilding?

Thanks in advance
TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Duncan
ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:

 If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
 an 8.3ms rotational latency (half a rotation), an average 64kb write and
 a 100MB/S streaming write speed, each write comes in at ~21ms, which
 gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
 DISK LIMITED.

 The 5MB/S that TM is seeing is fine, considering the small files he says
 he has.

Thanks for the additional numbers supporting my point. =:^)

I had run some of the numbers but not to the extent you just did, so I 
didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the 
range of expectation for spinning rust, given the current state of 
optimization... or more accurately the lack thereof, due to the focus 
still being on features.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html