subject:"1 week to rebuid 4x 3TB raid10 is a long time\!"

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-24 Thread Chris Murphy

On Jul 22, 2014, at 11:13 AM, Chris Murphy  wrote:
> 
> It's been a while since I did a rebuild on HDDs, 

So I did this yesterday and day before with an SSD and HDD in raid1, and made 
the HDD do the rebuild. 

Baseline for this hard drive:
hdparm -t
35.68 MB/sec

dd if=/dev/zero of=/dev/rdisk2s1 bs=256k
13508091392 bytes transferred in 521.244920 secs (25915056 bytes/sec)

I don't know why hdparm gets such good reads, and dd writes are 75% of that, 
but the 26MB/s write speed is realistic (this is a Firewire 400 external 
device) and what I typically get with long sequential writes. It's probable 
this is interface limited to mode S200, not a drive limitation since on SATA 
Rev 2 or 3 interface I get 100+MB/s transfers.

During the rebuild, iotop reports actual write averaging in the 24MB/s range, 
and the total data to restore divided by total time for the replace command 
comes out to 23MB/s. The source data is a Fedora 21 install with no meaningful 
user data (cache files and such), so mostly a bunch of libraries, programs, and 
documentation. Therefore it's not exclusively small files, yet the iotop rate 
was very stable throughout the 4 minute rebuild.

So I still think 5MB/s for a SATA connected (?) drive is to be unexpected.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread TM

Stefan Behrens  giantdisaster.de> writes:


> TM, Just read the man-page. You could have used the replace tool after
> physically removing the failing device.
> 
> Quoting the man page:
> "If the source device is not available anymore, or if the -r option is
> set, the data is built only using the RAID redundancy mechanisms.
> 
> Options
> -r   only read from  if no other zero-defect mirror
>  exists (enable this if your drive has lots of read errors,
>  the access would be very slow)"
> 
> Concerning the rebuild performance, the access to the disk is linear for
> both reading and writing, I measured above 75 MByte/s at that time with
> regular 7200 RPM disks, which would be less than 10 hours to replace a
> 3TB disk (in worst case, if it is completely filled up).
> Unused/unallocated areas are skipped and additionally improve the
> rebuild speed.
> 
> For missing disks, unfortunately the command invocation is not using the
> term "missing" but the numerical device-id instead of the device name.
> "missing" _is_ implemented in the kernel part of the replace code, but
> was simply forgotten in the user mode part, at least it was forgotten in
> the man page.
> 

Hi Stefan,
thank you very much, for the comprehensive info, I will opt to use replace
next time.

Breaking news :-) 
from Jul 19 14:41:36 microserver kernel: [ 1134.244007] btrfs: relocating
block group 8974430633984 flags 68
to  Jul 22 16:54:54 microserver kernel: [268419.463433] btrfs: relocating
block group 2991474081792 flags 65

Rebuild ended before counting down to 
So flight time was 3 days, and I see no more messages or btrfs processes
utilizing cpu. So rebuild seams ready.
Just a few hours ago another disk showed some earlly touble accumulating
Current_Pending_Sector but no Reallocated_Sector_Ct yet.


TM

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread Chris Murphy

On Jul 21, 2014, at 8:51 PM, Duncan <1i5t5.dun...@cox.net> wrote:

> 
>> It does not matter at all what the average file size is.
> 
> … and the filesize /does/ matter.

I'm not sure how. A rebuild is replicating chunks, not doing the equivalent of 
cp or rsync on files. Copying chunks (or strips of chunks in the case of 
raid10) should be a rather sequential operation. So I'm not sure where the 
random write behavior would come from that could drop the write performance to 
~5MB/s on drives that can read/write ~100MB/s.

>> Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
>> when doing a raid rebuild.
> 
> ... And perfectly reasonable, at least at this point, to expect ~5 MiB/
> sec total thruput, one spindle at a time, for btrfs.

It's been a while since I did a rebuild on HDDs, but on SSDs the rebuilds have 
maxed out the replacement drive. Obviously the significant difference is 
rotational latency. If everyone with spinning disks and many small files is 
getting 5MB/s rebuilds, it suggests a rotational latency penalty if the 
performance is expected. I'm just not sure where that would be coming from. 
Random IO would incur the effect of rotational latency, but the rebuild 
shouldn't be random IO, rather sequential.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread Stefan Behrens

On Tue, 22 Jul 2014 14:43:45 + (UTC), Tm wrote:
> Wang Shilong  cn.fujitsu.com> writes:
> 
> 
>> The latest btrfs-progs include man page of btrfs-replace. Actually, you 
>> could use it
>> something like:
>>
>> btrfs replace start  |  
>>
>> You could use 'btrfs file show' to see missing device id. and then run 
>> btrfs replace.
>>
> 
> Hi Wang,
> 
>   I physically removed the drive before the rebuild, having a failing device
> as a source is not a good idea anyway.
>   Without the device in place, the device name is not showing up, since the
> missing device is not under /dev/sdXX or anything else. 
> 
>   That is why I asked if the special parameter 'missing' may be used in a
> replace. I can't say if it is supported. But I guess not, since I found no
> documentation on this matter.
> 
>   So I guess replace is not aimed at fault tolerance / rebuilding. It's just
> a convenient way to lets lay replace the disks with larger disks , to extend
> your array. A convenience tool, not an emergency tool.

TM, Just read the man-page. You could have used the replace tool after
physically removing the failing device.

Quoting the man page:
"If the source device is not available anymore, or if the -r option is
set, the data is built only using the RAID redundancy mechanisms.

Options
-r   only read from  if no other zero-defect mirror
 exists (enable this if your drive has lots of read errors,
 the access would be very slow)"

Concerning the rebuild performance, the access to the disk is linear for
both reading and writing, I measured above 75 MByte/s at that time with
regular 7200 RPM disks, which would be less than 10 hours to replace a
3TB disk (in worst case, if it is completely filled up).
Unused/unallocated areas are skipped and additionally improve the
rebuild speed.

For missing disks, unfortunately the command invocation is not using the
term "missing" but the numerical device-id instead of the device name.
"missing" _is_ implemented in the kernel part of the replace code, but
was simply forgotten in the user mode part, at least it was forgotten in
the man page.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-22 Thread TM

Wang Shilong  cn.fujitsu.com> writes:

> The latest btrfs-progs include man page of btrfs-replace. Actually, you 
> could use it
> something like:
> 
> btrfs replace start  |  
> 
> You could use 'btrfs file show' to see missing device id. and then run 
> btrfs replace.
> 

Hi Wang,

  I physically removed the drive before the rebuild, having a failing device
as a source is not a good idea anyway.
  Without the device in place, the device name is not showing up, since the
missing device is not under /dev/sdXX or anything else. 

  That is why I asked if the special parameter 'missing' may be used in a
replace. I can't say if it is supported. But I guess not, since I found no
documentation on this matter.

  So I guess replace is not aimed at fault tolerance / rebuilding. It's just
a convenient way to lets lay replace the disks with larger disks , to extend
your array. A convenience tool, not an emergency tool.

TM

> Thanks,
> Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Duncan

ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted:

> On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
>>
>>> If you assume a 12ms average seek time (normal for 7200RPM SATA
>>> drives), an 8.3ms rotational latency (half a rotation), an average
>>> 64kb write and a 100MB/S streaming write speed, each write comes in
>>> at ~21ms, which gives us ~47 IOPS.  With the 64KB write size, this
>>> comes out to ~3MB/S, DISK LIMITED.
>>
>>> The 5MB/S that TM is seeing is fine, considering the small files he
>>> says he has.
>>
> That is actually nonsense.
> Raid rebuild operates on the block/stripe layer and not on the
> filesystem layer.

If we were talking about a normal raid, yes.  But we're talking about 
btrFS, note the FS for filesystem, so indeed it *IS* the filesystem 
layer.  Now this particular "filesystem" /does/ happen to have raid 
properties as well, but it's definitely filesystem level...

> It does not matter at all what the average file size is.

... and the filesize /does/ matter.

> Raid rebuild is really only limited by disk i/o speed when performing a
> linear read of the whole spindle using huge i/o sizes,
> or, if you have multiple spindles on the same bus, the bus saturation
> speed.

Makes sense... if you're dealing at the raid level.  If we were talking 
about dmraid or mdraid... and they're both much more mature and 
optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed 
be a reasonable expectation for them.

But (barring bugs, which will and do happen at this stage of development) 
btrfs both makes far better data validity guarantees, and does a lot more 
complex processing what with COW and snapshotting, etc, of course in 
addition to the normal filesystem level stuff AND the raid-level stuff it 
does.

> Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
> when doing a raid rebuild.

... And perfectly reasonable, at least at this point, to expect ~5 MiB/
sec total thruput, one spindle at a time, for btrfs.

> That is for the naive rebuild that rebuilds every single stripe. A
> smarter rebuild that knows which stripes are unused can skip the unused
> stripes and thus become even faster than that.
> 
> 
> Now, that the rebuild is off by an order of magnitude is by design but
> should be fixed at some stage, but with the current state of btrfs it is
> probably better to focus on other more urgent areas first.

Because of all the extra work it does, btrfs may never get to full 
streaming speed across all spindles at once.  But it can and will 
certainly get much better than it is, once the focus moves to 
optimization.  *AND*, because it /does/ know which areas of the device 
are actually in use, once btrfs is optimized, it's quite likely that 
despite the slower raw speed, because it won't have to deal with the 
unused area, at least with the typically 20-60% unused filesystems most 
people run, rebuild times will match or be faster than raid-layer-only 
technologies that must rebuild the entire device, because they do /not/ 
know which areas are unused.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong


On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong  cn.fujitsu.com> writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.



Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX
The latest btrfs-progs include man page of btrfs-replace. Actually, you 
could use it

something like:

btrfs replace start  |  

You could use 'btrfs file show' to see missing device id. and then run 
btrfs replace.


Thanks,
Wang



TM


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong


On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong  cn.fujitsu.com> writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.




Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX


TM


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Chris Murphy

On Jul 21, 2014, at 10:46 AM, ronnie sahlberg  wrote:

> On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
>> 
>>> If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
>>> an 8.3ms rotational latency (half a rotation), an average 64kb write and
>>> a 100MB/S streaming write speed, each write comes in at ~21ms, which
>>> gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
>>> DISK LIMITED.
>> 
>>> The 5MB/S that TM is seeing is fine, considering the small files he says
>>> he has.
>> 
>> Thanks for the additional numbers supporting my point. =:^)
>> 
>> I had run some of the numbers but not to the extent you just did, so I
>> didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
>> range of expectation for spinning rust, given the current state of
>> optimization... or more accurately the lack thereof, due to the focus
>> still being on features.
>> 
> 
> That is actually nonsense.
> Raid rebuild operates on the block/stripe layer and not on the filesystem 
> layer.

Not on Btrfs. It is on the filesystem layer. However, a rebuild is about 
replicating metadata (up to 256MB) and data (up to 1GB) chunks. For raid10, 
those are further broken down into 64KB strips. So the smallest size "unit" for 
replication during a rebuild on Btrfs would be 64KB.

Anyway 5MB/s seems really low to me, so I'm suspicious something else is going 
on. I haven't done a rebuild in a couple months, but my recollection is it's 
always been as fast as the write performance of a single device in the btrfs 
volume.

I'd be looking in dmesg for any of the physical drives being reset, having read 
or write errors, and I'd do some individual drive testing to see if the problem 
can be isolated. And if that's not helpful, well, this is really tedious and 
verbose amounts of information but it might reveal some issue is to capture 
actual commands going to physical devices:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34886.html

My expectation (i.e. I'm guessing) based on previous testing is that whether 
raid1 or raid10, the actual read/write commands will each be 256KB in size. 
Btrfs rebuild is basically designed to be a sequential operation. This could 
maybe fall apart if there were somehow many minimally full chunks, which is 
probably unlikely.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread ronnie sahlberg

On Sun, Jul 20, 2014 at 7:48 PM, Duncan <1i5t5.dun...@cox.net> wrote:
> ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
>
>> If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
>> an 8.3ms rotational latency (half a rotation), an average 64kb write and
>> a 100MB/S streaming write speed, each write comes in at ~21ms, which
>> gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
>> DISK LIMITED.
>
>> The 5MB/S that TM is seeing is fine, considering the small files he says
>> he has.
>
> Thanks for the additional numbers supporting my point. =:^)
>
> I had run some of the numbers but not to the extent you just did, so I
> didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
> range of expectation for spinning rust, given the current state of
> optimization... or more accurately the lack thereof, due to the focus
> still being on features.
>

That is actually nonsense.
Raid rebuild operates on the block/stripe layer and not on the filesystem layer.
It does not matter at all what the average file size is.

Raid rebuild is really only limited by disk i/o speed when performing
a linear read of the whole spindle using huge i/o sizes,
or, if you have multiple spindles on the same bus, the bus saturation speed.

Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
when doing a raid rebuild.
That is for the naive rebuild that rebuilds every single stripe. A
smarter rebuild that knows which stripes are unused can skip the
unused stripes and thus become even faster than that.

Now, that the rebuild is off by an order of magnitude is by design but
should be fixed at some stage, but with the current state of btrfs it
is probably better to focus on other more urgent areas first.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread TM

Wang Shilong  cn.fujitsu.com> writes:

> Just my two cents:
> 
> Since 'btrfs replace' support RADI10, I suppose using replace
> operation is better than 'device removal and add'.
> 
> Another Question is related to btrfs snapshot-aware balance.
> How many snapshots did you have in your system?
> 
> Of course, During balance/resize/device removal operations,
> you could still snapshot, but fewer snapshots should speed things up!
> 
> Anyway 'btrfs replace' is implemented more effective than
> 'device remova and add'.
> 

Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX

TM

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Duncan

ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:

> If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
> an 8.3ms rotational latency (half a rotation), an average 64kb write and
> a 100MB/S streaming write speed, each write comes in at ~21ms, which
> gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
> DISK LIMITED.

> The 5MB/S that TM is seeing is fine, considering the small files he says
> he has.

Thanks for the additional numbers supporting my point. =:^)

I had run some of the numbers but not to the extent you just did, so I 
didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the 
range of expectation for spinning rust, given the current state of 
optimization... or more accurately the lack thereof, due to the focus 
still being on features.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Wang Shilong


Hi,

On 07/20/2014 04:45 PM, TM wrote:

Hi,

I have a raid10 with 4x 3TB disks on a microserver
http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM

Recently one disk started to fail (smart errors), so I replaced it
Mounted as degraded, added new disk, removed old
Started yesterday
I am monitoring /var/log/messages and it seems it will take a long time
Started at about 8010631739392
And 20 hours later I am at 6910631739392
btrfs: relocating block group 6910631739392 flags 65

At this rate it will take a week to complete the raid rebuild!!!

Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.:-)

Thanks,
Wang


Furthermore it seems that the operation is getting slower and slower
When the rebuild started I had a new message every half a minute, now it’s
getting to OneAndHalf minutes
Most files are small files like flac/jpeg

One week for a raid10 rebuild 4x3TB drives is a very long time.
Any thoughts?
Can you share any statistics from your RAID10 rebuilds?

If I shut down the system, before the rebuild, what is the proper procedure
to remount it? Again degraded? Or normally? Can the process of rebuilding
the raid continue after a reboot? Will it survive, and continue rebuilding?

Thanks in advance
TM


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread George Mitchell


On 07/20/2014 02:28 PM, Bob Marley wrote:

On 20/07/2014 21:36, Roman Mamedov wrote:

On Sun, 20 Jul 2014 21:15:31 +0200
Bob Marley  wrote:


Hi TM, are you doing other significant filesystem activity during this
rebuild, especially random accesses?
This can reduce performances a lot on HDDs.
E.g. if you were doing strenous multithreaded random writes in the
meanwhile, I could expect even less than 5MB/sec overall...
I believe the problem here might be that a Btrfs rebuild *is* a 
strenuous

random read (+ random-ish write) just by itself.

Mdadm-based RAID would rebuild the array reading/writing disks in a 
completely

linear manner, and it would finish an order of magnitude faster.


Now this explains a lot!
So they would just need to be sorted?
Sorting the files of a disk from lowest to highers block number prior 
to starting reconstruction seems feasible. Maybe not all of them 
together because they will be millions, but sorting them in chunks of 
1000 files would still produce a very significant speedup!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


As I understand the problem, it has to do with where btrfs is in the 
overall development process.  There are a LOT of opportunities for 
optimization, but optimization cannot begin until btrfs is feature 
complete, because any work done beforehand would be wasted effort in 
that it would likely have to be repeated after being broken by feature 
enhancements.  So now it is a waiting game for completion of all the 
major features (like additional RAID levels and possible n-way options, 
etc) before optimization efforts can begin.  Once that happens we will 
likely see HUGE gains in efficiency and speed, but until then we are 
kind of stuck in this position where it "works" but leaves somewhat to 
be desired.  I think this is one reason developers often caution users 
not to expect too much from btrfs at this point.  Its just not there yet 
and it will still be some time yet before it is.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Bob Marley


On 20/07/2014 21:36, Roman Mamedov wrote:

On Sun, 20 Jul 2014 21:15:31 +0200
Bob Marley  wrote:


Hi TM, are you doing other significant filesystem activity during this
rebuild, especially random accesses?
This can reduce performances a lot on HDDs.
E.g. if you were doing strenous multithreaded random writes in the
meanwhile, I could expect even less than 5MB/sec overall...

I believe the problem here might be that a Btrfs rebuild *is* a strenuous
random read (+ random-ish write) just by itself.

Mdadm-based RAID would rebuild the array reading/writing disks in a completely
linear manner, and it would finish an order of magnitude faster.


Now this explains a lot!
So they would just need to be sorted?
Sorting the files of a disk from lowest to highers block number prior to 
starting reconstruction seems feasible. Maybe not all of them together 
because they will be millions, but sorting them in chunks of 1000 files 
would still produce a very significant speedup!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread ashford

This is the cause for the slow reconstruct.

> I believe the problem here might be that a Btrfs rebuild *is* a strenuous
> random read (+ random-ish write) just by itself.

If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
an 8.3ms rotational latency (half a rotation), an average 64kb write and a
100MB/S streaming write speed, each write comes in at ~21ms, which gives
us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S, DISK
LIMITED.

The on-disk cache helps a bit during the startup, but once the cache is
full, it's back to writes at disk speed, with some small gains if the
on-disk controller can schedule the writes efficiently.

Based on the single-threaded I/O that BTRFS uses during a reconstruct, I
expect that the average write size is somewhere around 200KB. 
Multi-threading the reconstruct disk I/O (possibly adding look-ahead),
would double the reconstruct speed for this array, but that's not a
trivial task.

The 5MB/S that TM is seeing is fine, considering the small files he says
he has.

Peter Ashford

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Roman Mamedov

On Sun, 20 Jul 2014 21:15:31 +0200
Bob Marley  wrote:

> Hi TM, are you doing other significant filesystem activity during this 
> rebuild, especially random accesses?
> This can reduce performances a lot on HDDs.
> E.g. if you were doing strenous multithreaded random writes in the 
> meanwhile, I could expect even less than 5MB/sec overall...

I believe the problem here might be that a Btrfs rebuild *is* a strenuous
random read (+ random-ish write) just by itself.

Mdadm-based RAID would rebuild the array reading/writing disks in a completely
linear manner, and it would finish an order of magnitude faster.

-- 
With respect,
Roman

signature.asc
Description: PGP signature

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Bob Marley


On 20/07/2014 10:45, TM wrote:

Hi,

I have a raid10 with 4x 3TB disks on a microserver
http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM

Recently one disk started to fail (smart errors), so I replaced it
Mounted as degraded, added new disk, removed old
Started yesterday
I am monitoring /var/log/messages and it seems it will take a long time
Started at about 8010631739392
And 20 hours later I am at 6910631739392
btrfs: relocating block group 6910631739392 flags 65

At this rate it will take a week to complete the raid rebuild!!!

Furthermore it seems that the operation is getting slower and slower
When the rebuild started I had a new message every half a minute, now it’s
getting to OneAndHalf minutes
Most files are small files like flac/jpeg



Hi TM, are you doing other significant filesystem activity during this 
rebuild, especially random accesses?

This can reduce performances a lot on HDDs.
E.g. if you were doing strenous multithreaded random writes in the 
meanwhile, I could expect even less than 5MB/sec overall...


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread TM

  whisperpc.com> writes:

> 
> Finally, TM didn't mention anything about other I/O activity on the array,
> which, regardless of the method of reconstruction, could have a
> significant impact on the speed of a reconstruction.
> 
> There are a LOT of parameters here that could impact throughput.  Some are
> designed in (the checksum computations), some are "temporary" (the single
> I/O path) and some are end-user issues (slow CPU and other activity on the
> array).  I'm sure that there are other parameters, possibly including soft
> read errors on the other disks, that could be impacting the overall
> throughput.
> 
> As all the information that could affect performance hasnt been provided
> yet, it is premature to make a blanket statement that the performance of a
> reconstruction is "unreasonable".  For the circumstances, it's possible
> that the performance is just fine.  We have not yet been provided with
> enough data to verify this, one way or another.
> 
> Peter Ashford

Hi Peter,

  I am monitoring closely, I do not have any 'soft-errors' or anything else.
No other users are using the raid10 array. No other compute/disk intensive
tasks etc. So I am providing data -  usual and typical data at least.

  To answer another question, yes I do have backups.

  So this I guess it the speed most users get, which is in my opinion very
slow. 
  And I guess that's around the speed most users will get in similar 4x 3TB
build with low power CPUs.

  Long recovery time, leads to higher risk of another disk failure or other
unexpected failures. 

TM

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread TM

  whisperpc.com> writes:

> 
> Finally, TM didn't mention anything about other I/O activity on the array,
> which, regardless of the method of reconstruction, could have a
> significant impact on the speed of a reconstruction.
> 
> There are a LOT of parameters here that could impact throughput.  Some are
> designed in (the checksum computations), some are "temporary" (the single
> I/O path) and some are end-user issues (slow CPU and other activity on the
> array).  I'm sure that there are other parameters, possibly including soft
> read errors on the other disks, that could be impacting the overall
> throughput.
> 
> As all the information that could affect performance hasnt been provided
> yet, it is premature to make a blanket statement that the performance of a
> reconstruction is "unreasonable".  For the circumstances, it's possible
> that the performance is just fine.  We have not yet been provided with
> enough data to verify this, one way or another.
> 
> Peter Ashford

Hi Peter,

  I am monitoring closely, I do not have any 'soft-errors' or anything else.
No other users are using the raid10 array. No other compute/disk intensive
tasks etc. So I am providing data -  usual and typical data at least.

  To answer another question, yes I do have backups.

  So this I guess it the speed most users get, which is in my opinion very
slow. 
  And I guess that's around the speed most users will get in similar 4x 3TB
build with low power CPUs.

  Long recovery time, leads to higher risk of another disk failure or other
unexpected failures. 

TM

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread ashford

Tomasz,

> On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote:
>> TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:
>>
>>> One week for a raid10 rebuild 4x3TB drives is a very long time.
>>> Any thoughts?
>>> Can you share any statistics from your RAID10 rebuilds?
>>
>> At a week, that's nearly 5 MiB per second, which isn't great, but isn't
>> entirely out of the realm of reason either, given all the processing
>> it's
>> doing.  A day would be 33.11+, reasonable thruput for a straight copy,
>> and a raid rebuild is rather more complex than a straight copy, so...
>
>   Uhm, sorry, but 5MBps is _entirely_ unreasonable.  It is
> order-of-magnitude
> unreasonable.  And "all the processing" shouldn't even show as a blip
> on modern CPUs.
>   This "speed" is undefendable.

Maybe and maybe not.  It's already been discussed that BTRFS serializes
its disk I/O, instead of performing it in parallel.  That can have a
significant impact on overall throughput.

In addition, depending on how BTRFS does a reconstruct, it's possible that
there's another problem.  Most low-level RAID implementations (MD driver
and RAID controller cards) reconstruct a segment at a time (or possibly
multiple segments), and reconstruct the entire disk, even the unused
portions.

ZFS is different, in that it reconstructs only the used sections.  This
behavior will read/write less to the disk, but it will have more random
I/O.  The random I/O can result in more seeks (depending on how busy the
system is), which could slow down the reconstruction, especially on an
idle system.  As ZFS was designed for high usage environments, where it's
likely that a seek would be needed to get the head back to the data being
reconstructed, the randomness of the reconstruct would have no significant
impact on a reconstruct.

The question implied above is which method of reconstruction BTRFS uses. 
Could someone please inform us?

As BTRFS is verifying the checksum on the read, and possibly generating a
new checksum on write, the CPU time used will be significantly greater
than for a "dumb" RAID-10 array.  If it's not a sequential segment
reconstruction, the CPU time needed would be greater still.  WRT CPU
performance, we should remember that TM's system is far from a speed
demon.

Finally, TM didn't mention anything about other I/O activity on the array,
which, regardless of the method of reconstruction, could have a
significant impact on the speed of a reconstruction.

There are a LOT of parameters here that could impact throughput.  Some are
designed in (the checksum computations), some are "temporary" (the single
I/O path) and some are end-user issues (slow CPU and other activity on the
array).  I'm sure that there are other parameters, possibly including soft
read errors on the other disks, that could be impacting the overall
throughput.

As all the information that could affect performance hasnt been provided
yet, it is premature to make a blanket statement that the performance of a
reconstruction is "unreasonable".  For the circumstances, it's possible
that the performance is just fine.  We have not yet been provided with
enough data to verify this, one way or another.

Peter Ashford

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Austin S Hemmelgarn

On 07/20/2014 10:00 AM, Tomasz Torcz wrote:
> On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote:
>> TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:
>>
>>> One week for a raid10 rebuild 4x3TB drives is a very long time.
>>> Any thoughts?
>>> Can you share any statistics from your RAID10 rebuilds?
>>
>>
>> At a week, that's nearly 5 MiB per second, which isn't great, but isn't 
>> entirely out of the realm of reason either, given all the processing it's 
>> doing.  A day would be 33.11+, reasonable thruput for a straight copy, 
>> and a raid rebuild is rather more complex than a straight copy, so...
> 
>   Uhm, sorry, but 5MBps is _entirely_ unreasonable.  It is order-of-magnitude
> unreasonable.  And "all the processing" shouldn't even show as a blip
> on modern CPUs.
>   This "speed" is undefendable.
> 
I wholly agree that it's undefendable, but I can tell you why it is so
slow, it's not 'all the processing' (which is maybe a few hundred
instructions on x86 for each block), it's because BTRFS still serializes
writes to devices, instead of queuing all of them in parallel (that is,
when there are four devices that need written to, it writes to each one
in sequence, waiting for the previous write to finish before dispatching
the next write).  Personally, I would love to see this behavior
improved, but I really don't have any time to work on it myself.



smime.p7s
Description: S/MIME Cryptographic Signature

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Tomasz Torcz

On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote:
> TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:
> 
> > One week for a raid10 rebuild 4x3TB drives is a very long time.
> > Any thoughts?
> > Can you share any statistics from your RAID10 rebuilds?
> 
> 
> At a week, that's nearly 5 MiB per second, which isn't great, but isn't 
> entirely out of the realm of reason either, given all the processing it's 
> doing.  A day would be 33.11+, reasonable thruput for a straight copy, 
> and a raid rebuild is rather more complex than a straight copy, so...

  Uhm, sorry, but 5MBps is _entirely_ unreasonable.  It is order-of-magnitude
unreasonable.  And "all the processing" shouldn't even show as a blip
on modern CPUs.
  This "speed" is undefendable.

-- 
Tomasz Torcz   RIP is irrevelant. Spoofing is futile.
xmpp: zdzich...@chrome.pl Your routes will be aggreggated. -- Alex Yuriev

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Duncan

TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:

> One week for a raid10 rebuild 4x3TB drives is a very long time.
> Any thoughts?
> Can you share any statistics from your RAID10 rebuilds?

Well, 3 TB is big and spinning rust is slow.  Even using the smaller 
power-of-10 (1000) figures for TB that the device manufacturers use 
instead of the power-of-2 (1024, TiB) figures that are common in 
computing...

TB GB   MB   KB   BKiB  MiB s/hr day wk
3*1000*1000*1000*1000/1024/1024/3600/24/7=4.73+ MiB/sec

At a week, that's nearly 5 MiB per second, which isn't great, but isn't 
entirely out of the realm of reason either, given all the processing it's 
doing.  A day would be 33.11+, reasonable thruput for a straight copy, 
and a raid rebuild is rather more complex than a straight copy, so...

Which is one reason a lot of people are using partitioning to break down 
those huge numbers into something a bit more manageable in reasonable 
time, these days, or switching to much faster if also much more expensive 
per GiB (USD 50 cents to $1 per gig vs 5-10 cents per gig) SSDs.

And btrfs is still under development so hasn't been really optimized yet 
and is thus slower than necessary -- in particular, it often serializes 
multi-device processing where given that the bottleneck is normally 
device IO, an optimized algorithm would parallel-process all devices at 
once.  Just parallelizing the algorithm could give it a 2-4X speed 
increase on a 4-device raid10.

So you're right, it /is/ slow.

> If I shut down the system, before the rebuild, what is the proper
> procedure to remount it? Again degraded? Or normally? Can the process of
> rebuilding the raid continue after a reboot? Will it survive, and
> continue rebuilding?

Raid10 requires four devices for undegraded layout, but I /think/ once it 
has a forth device added back in, you should be able to mount it 
undegraded, as it can write changes to four devices at that point.  Tho 
I'm not positive about that.  I'd try mounting it undegraded here and if 
it worked, great, if not I'd mount it degraded again.

Regardless of that, however, barring bugs, provided you shut down 
properly (umounting the filesystem, etc), a shutdown and reboot should be 
fine, and it should continue where it left off after the reboot, as 
internally it's simply doing a rebalance of existing data to include the 
new device, and btrfs is designed to gracefully shutdown in the middle of 
a rebalance and restart it on reboot, when necessary.

Tho don't expect umount and shutdown to be instantaneous.  After you 
issue the umount command, it shouldn't start any new chunk balances, but 
it could require a bit to finish the balances in-flight at the time you 
issued the shutdown.  If it takes more than a few minutes, however, 
there's a bug.  FWIW, data chunks are a GiB in size, which at the 
calculated rate of a bit under 5 MiB/sec, should be ~205 seconds or 
roughly 3.5 minutes.  Doubling that and a bit more to be safe, I'd say 
wait 10 minutes or so.  If it hasn't properly umounted after 10 minutes, 
you likely have a bug and may have to recover after a reboot.  With btrfs 
still under heavy development backups are STRONGLY recommended so I hope 
you have them, but at this point anyway, while it's slow going there's no 
indication that you'll actually need to use those backups.  Just expect 
the umount to take a bit.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread TM

Hi,

I have a raid10 with 4x 3TB disks on a microserver
http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM

Recently one disk started to fail (smart errors), so I replaced it
Mounted as degraded, added new disk, removed old
Started yesterday
I am monitoring /var/log/messages and it seems it will take a long time
Started at about 8010631739392
And 20 hours later I am at 6910631739392 
btrfs: relocating block group 6910631739392 flags 65

At this rate it will take a week to complete the raid rebuild!!!

Furthermore it seems that the operation is getting slower and slower
When the rebuild started I had a new message every half a minute, now it’s
getting to OneAndHalf minutes
Most files are small files like flac/jpeg

One week for a raid10 rebuild 4x3TB drives is a very long time.
Any thoughts?
Can you share any statistics from your RAID10 rebuilds?

If I shut down the system, before the rebuild, what is the proper procedure
to remount it? Again degraded? Or normally? Can the process of rebuilding
the raid continue after a reboot? Will it survive, and continue rebuilding?

Thanks in advance
TM


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

1 week to rebuid 4x 3TB raid10 is a long time!

25 matches

Site Navigation

Mail list logo

Footer information