Re: RAID56 status?

2017-01-24 Thread Niccolò Belli

+1

On martedì 24 gennaio 2017 00:31:42 CET, Christoph Anton Mitterer wrote:

On Mon, 2017-01-23 at 18:18 -0500, Chris Mason wrote:

We've been focusing on the single-drive use cases internally.  This
year 
that's changing as we ramp up more users in different places.  
Performance/stability work and raid5/6 are the top of my list right

now.

+1

Would be nice to get some feedback on what happens behind the scenes...
 actually I think a regular btrfs development blog could be generally a
nice thing :)

Cheers,
Chris.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-23 Thread Christoph Anton Mitterer
On Mon, 2017-01-23 at 18:18 -0500, Chris Mason wrote:
> We've been focusing on the single-drive use cases internally.  This
> year 
> that's changing as we ramp up more users in different places.  
> Performance/stability work and raid5/6 are the top of my list right
> now.
+1

Would be nice to get some feedback on what happens behind the scenes...
 actually I think a regular btrfs development blog could be generally a
nice thing :)

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: RAID56 status?

2017-01-23 Thread Chris Mason

On Mon, Jan 23, 2017 at 06:53:21PM +0100, Christoph Anton Mitterer wrote:

Just wondered... is there any larger known RAID56 deployment? I mean
something with real-world production systems and ideally many different
IO scenarios, failures, pulling disks randomly and perhaps even so
many disks that it's also likely to hit something like silent data
corruption (on the disk level)?

Has CM already migrated all of Facebook's storage to btrfs RAID56?! ;-)
Well at least facebook.com seems till online ;-P *kidding*

I mean the good thing in having such a massive production-like
environment - especially when it's not just one homogeneous usage
pattern - is that it would help to build up quite some trust into the
code (once the already known bugs are fixed).


We've been focusing on the single-drive use cases internally.  This year 
that's changing as we ramp up more users in different places.  
Performance/stability work and raid5/6 are the top of my list right now.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-23 Thread Christoph Anton Mitterer
Just wondered... is there any larger known RAID56 deployment? I mean
something with real-world production systems and ideally many different
 IO scenarios, failures, pulling disks randomly and perhaps even so
many disks that it's also likely to hit something like silent data
corruption (on the disk level)?

Has CM already migrated all of Facebook's storage to btrfs RAID56?! ;-)
Well at least facebook.com seems till online ;-P *kidding*

I mean the good thing in having such a massive production-like
environment - especially when it's not just one homogeneous usage
pattern - is that it would help to build up quite some trust into the
code (once the already known bugs are fixed).



Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: RAID56 status?

2017-01-23 Thread Janos Toth F.
On Mon, Jan 23, 2017 at 7:57 AM, Brendan Hide  wrote:
>
> raid0 stripes data in 64k chunks (I think this size is tunable) across all
> devices, which is generally far faster in terms of throughput in both
> writing and reading data.

I remember seeing some proposals for configurable stripe size in the
form of patches (which changed a lot over time) but I don't think the
idea reached a consensus (let alone if a final patch materialized and
got merged). I think it would be a nice feature though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-23 Thread Brendan Hide


Hey, all

Long-time lurker/commenter here. Production-ready RAID5/6 and N-way 
mirroring are the two features I've been anticipating most, so I've 
commented regularly when this sort of thing pops up. :)


I'm only addressing some of the RAID-types queries as Qu already has a 
handle on the rest.


Small-yet-important hint: If you don't have a backup of it, it isn't 
important.


On 01/23/2017 02:25 AM, Jan Vales wrote:

[ snip ]
Correct me, if im wrong...
* It seems, raid1(btrfs) is actually raid10, as there are no more than 2
copies of data, regardless of the count of devices.


The original "definition" of raid1 is two mirrored devices. The *nix 
industry standard implementation (mdadm) extends this to any number of 
mirrored devices. Thus confusion here is understandable.



** Is there a way to duplicate data n-times?


This is a planned feature, especially in lieu of feature-parity with 
mdadm, though the priority isn't particularly high right now. This has 
been referred to as "N-way mirroring". The last time I recall discussion 
over this, it was hoped to get work started on it after raid5/6 was stable.



** If there are only 3 devices and the wrong device dies... is it dead?


Qu has the right answers. Generally if you're using anything other than 
dup, raid0, or single, one disk failure is "okay". More than one failure 
is closer to "undefined". Except with RAID6, where you need to have more 
than two disk failures before you have lost data.



* Whats the diffrence of raid1(btrfs) and raid10(btrfs)?


Some nice illustrations from Qu there. :)


** After reading like 5 diffrent wiki pages, I understood, that there
are diffrences ... but not what they are and how they affect me :/
* Whats the diffrence of raid0(btrfs) and "normal" multi-device
operation which seems like a traditional raid0 to me?


raid0 stripes data in 64k chunks (I think this size is tunable) across 
all devices, which is generally far faster in terms of throughput in 
both writing and reading data.


By '"normal" multi-device' I will assume this means "single" with 
multiple devices. New writes with "single" will use a 1GB chunk on one 
device until the chunk is full, at which point it allocates a new chunk, 
which will usually be put on the disk with the most available free 
space. There is no particular optimisation in place comparable to raid0 
here.




Maybe rename/alias raid-levels that do not match traditional
raid-levels, so one cannot expect some behavior that is not there.



The extreme example is imho raid1(btrfs) vs raid1.
I would expect that if i have 5 btrfs-raid1-devices, 4 may die and btrfs
should be able to fully recover, which, if i understand correctly, by
far does not hold.
If you named that raid-level say "george" ... I would need to consult
the docs and I obviously would not expect any behavior. :)


We've discussed this a couple of times. Hugo came up with a notation 
since dubbed "csp" notation: c->Copies, s->Stripes, and p->Parities.


Examples of this would be:
raid1: 2c
3-way mirroring across 3 (or more*) devices: 3c
raid0 (2-or-more-devices): 2s
raid0 (3-or-more): 3s
raid5 (5-or-more): 4s1p
raid16 (12-or-more): 2c4s2p

* note the "or more": Mdadm *cannot* mirror less mirrors or stripes than 
devices, whereas there is no particular reason why btrfs won't be able 
to do this.


A minor problem with csp notation is that it implies a complete 
implementation of *any* combination of these, whereas the idea was 
simply to create a way to refer to the "raid" levels in a consistent way.


I hope this brings some clarity. :)



regards,
Jan Vales
--
I only read plaintext emails.



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-22 Thread Qu Wenruo



At 01/23/2017 12:42 PM, Zane Zakraisek wrote:

Hi Qu,
I've seen a good amount of Raid56 patches come in from you on the
mailing list. Do these catch a large portion of the Raid56 bugs, or are
they only the beginning? :)


Hard to say, it can be just tip of a iceberg, or beginning of RAID56 doom.

What I can do is just fixing bugs reported by users and let the patches 
goes through xfstests and internal test scripts.


So the patches just catch a large portion of *known* RAID56 bugs, I 
don't know how many hidden.


Thanks,
Qu



ZZ

On Sun, Jan 22, 2017, 6:34 PM Qu Wenruo > wrote:



At 01/23/2017 08:25 AM, Jan Vales wrote:
> On 01/22/2017 11:39 PM, Hugo Mills wrote:
>> On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton
Mitterer wrote:
>>> On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:
 Therefore my question: whats the status of raid5/6 is in btrfs?
 Is it somehow "production"-ready by now?
>>> AFAIK, what's on the - apparently already no longer updated -
>>> https://btrfs.wiki.kernel.org/index.php/Status still applies, and
>>> RAID56 is not yet usable for anything near production.
>>
>>It's still all valid. Nothing's changed.
>>
>>How would you like it to be updated? "Nope, still broken"?
>>
>>Hugo.
>>
>>

I'd like to update the wiki to "More and more RAID5/6 bugs are found" :)

OK, no kidding, at least we did exposed several new bugs, and reports
already exists for a while in mail list.

Some examples are:

1) RAID5/6 scrub will repair data while corrupting parity
Quite ironic, repairing is just changing one corruption to
another.

2) RAID5/6 scrub can report false alerts on csum error

3) Dev-replace cancel sometimes can cause kernel panic.

And if we find more bugs, I'm not surprised at all.

So, if really want to use RAID5/6, please use soft raid, then build
single volume btrfs on it.

I'm seriously considering to re-implement btrfs RAID5/6 using device
mapper, which is tried and true.

>
> As the changelog stops at 4.7 the wiki seemed a little dead - "still
> broken as of $(date)" or something like that would be nice ^.^
>
> Also some more exact documentation/definition of btrfs' raid-levels
> would be cool, as they seem to mismatch traditional raid-levels -
or at
> least I as an ignorant user fail to understand them...

man mkfs.btrfs has a quite good table for the btrfs profiles.

>
> Correct me, if im wrong...
> * It seems, raid1(btrfs) is actually raid10, as there are no more
than 2
> copies of data, regardless of the count of devices.

Somewhat right, despite the stripe size of RAID10 is 64K while RAID1 is
chunk size(1G for data normally), and the large stripe size for RAID1
makes it meaningless to call it RAID0.

> ** Is there a way to duplicate data n-times?

The only supported n-times duplication is 3-times duplication, which
uses RAID6 on 3 devices, and I don't consider it safe compared to RAID1.

> ** If there are only 3 devices and the wrong device dies... is it
dead?

For RAID1/10/5/6, theoretically it's still alive.
RAID5/6 is of course no problem for it.

For RAID1, always 2 mirrors and mirrors are always located on difference
device, so no matter which mirrors dies, btrfs can still read it out.

But in practice, it's btrfs, you know right?

> * Whats the diffrence of raid1(btrfs) and raid10(btrfs)?

RAID1: Pure mirror, no striping
   Disk 1|   Disk 2

  Data Data Data Data Data   | Data Data Data Data Data
  \  /
  Full one chunk

While chunks are always allocated to the device with most unallocated
space, you can consider it as extent level RAID1 with chunk level RAID0.

RAID10: RAID1 first, then RAID0
 IIRC RAID0 stripe size is 64K

Disk 1 | Data 1 (64K) Data 4 (64K)
Disk 2 | Data 1 (64K) Data 4 (64K)
---
Disk 3 | Data 2 (64K)
Disk 4 | Data 2 (64K)
---
Disk 5 | Data 3 (64K)
Disk 6 | Data 3 (64K)


> ** After reading like 5 diffrent wiki pages, I understood, that there
> are diffrences ... but not what they are and how they affect me :/

Chunk level striping won't have any obvious performance advantage, while
64K level striping do.

> * Whats the diffrence of raid0(btrfs) and "normal" multi-device
> operation which seems like a traditional raid0 to me?

What's "normal" or traditional RAID0?
Doesn't it uses all devices for striping? Or just uses 2?



Btrfs RAID0 is always using stripe size 64K (not only RAID0, but also
RAID10/5/6).


Re: RAID56 status?

2017-01-22 Thread Qu Wenruo



At 01/23/2017 08:25 AM, Jan Vales wrote:

On 01/22/2017 11:39 PM, Hugo Mills wrote:

On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton Mitterer wrote:

On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:

Therefore my question: whats the status of raid5/6 is in btrfs?
Is it somehow "production"-ready by now?

AFAIK, what's on the - apparently already no longer updated -
https://btrfs.wiki.kernel.org/index.php/Status still applies, and
RAID56 is not yet usable for anything near production.


   It's still all valid. Nothing's changed.

   How would you like it to be updated? "Nope, still broken"?

   Hugo.




I'd like to update the wiki to "More and more RAID5/6 bugs are found" :)

OK, no kidding, at least we did exposed several new bugs, and reports 
already exists for a while in mail list.


Some examples are:

1) RAID5/6 scrub will repair data while corrupting parity
   Quite ironic, repairing is just changing one corruption to
   another.

2) RAID5/6 scrub can report false alerts on csum error

3) Dev-replace cancel sometimes can cause kernel panic.

And if we find more bugs, I'm not surprised at all.

So, if really want to use RAID5/6, please use soft raid, then build 
single volume btrfs on it.


I'm seriously considering to re-implement btrfs RAID5/6 using device 
mapper, which is tried and true.




As the changelog stops at 4.7 the wiki seemed a little dead - "still
broken as of $(date)" or something like that would be nice ^.^

Also some more exact documentation/definition of btrfs' raid-levels
would be cool, as they seem to mismatch traditional raid-levels - or at
least I as an ignorant user fail to understand them...


man mkfs.btrfs has a quite good table for the btrfs profiles.



Correct me, if im wrong...
* It seems, raid1(btrfs) is actually raid10, as there are no more than 2
copies of data, regardless of the count of devices.


Somewhat right, despite the stripe size of RAID10 is 64K while RAID1 is 
chunk size(1G for data normally), and the large stripe size for RAID1 
makes it meaningless to call it RAID0.



** Is there a way to duplicate data n-times?


The only supported n-times duplication is 3-times duplication, which 
uses RAID6 on 3 devices, and I don't consider it safe compared to RAID1.



** If there are only 3 devices and the wrong device dies... is it dead?


For RAID1/10/5/6, theoretically it's still alive.
RAID5/6 is of course no problem for it.

For RAID1, always 2 mirrors and mirrors are always located on difference 
device, so no matter which mirrors dies, btrfs can still read it out.


But in practice, it's btrfs, you know right?


* Whats the diffrence of raid1(btrfs) and raid10(btrfs)?


RAID1: Pure mirror, no striping
  Disk 1|   Disk 2

 Data Data Data Data Data   | Data Data Data Data Data
 \  /
 Full one chunk

While chunks are always allocated to the device with most unallocated 
space, you can consider it as extent level RAID1 with chunk level RAID0.


RAID10: RAID1 first, then RAID0
IIRC RAID0 stripe size is 64K

Disk 1 | Data 1 (64K) Data 4 (64K)
Disk 2 | Data 1 (64K) Data 4 (64K)
---
Disk 3 | Data 2 (64K)
Disk 4 | Data 2 (64K)
---
Disk 5 | Data 3 (64K)
Disk 6 | Data 3 (64K)



** After reading like 5 diffrent wiki pages, I understood, that there
are diffrences ... but not what they are and how they affect me :/


Chunk level striping won't have any obvious performance advantage, while 
64K level striping do.



* Whats the diffrence of raid0(btrfs) and "normal" multi-device
operation which seems like a traditional raid0 to me?


What's "normal" or traditional RAID0?
Doesn't it uses all devices for striping? Or just uses 2?



Btrfs RAID0 is always using stripe size 64K (not only RAID0, but also 
RAID10/5/6).


While btrfs chunk allocation also provide chunk size level striping, 
which is 1G for data (considering your fs is larger than 10G) or 256M 
for metadata.


But that striping size won't provide anything useful.
So you could just forgot that chunk level thing.

Despite that, btrfs RAID should quite match normal RAID.

Thanks,
Qu



Maybe rename/alias raid-levels that do not match traditional
raid-levels, so one cannot expect some behavior that is not there.
The extreme example is imho raid1(btrfs) vs raid1.
I would expect that if i have 5 btrfs-raid1-devices, 4 may die and btrfs
should be able to fully recover, which, if i understand correctly, by
far does not hold.
If you named that raid-level say "george" ... I would need to consult
the docs and I obviously would not expect any behavior. :)

regards,
Jan Vales
--
I only read plaintext emails.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-22 Thread Jan Vales
On 01/22/2017 11:39 PM, Hugo Mills wrote:
> On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton Mitterer wrote:
>> On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:
>>> Therefore my question: whats the status of raid5/6 is in btrfs?
>>> Is it somehow "production"-ready by now?
>> AFAIK, what's on the - apparently already no longer updated -
>> https://btrfs.wiki.kernel.org/index.php/Status still applies, and
>> RAID56 is not yet usable for anything near production.
> 
>It's still all valid. Nothing's changed.
> 
>How would you like it to be updated? "Nope, still broken"?
> 
>Hugo.
> 
> 

As the changelog stops at 4.7 the wiki seemed a little dead - "still
broken as of $(date)" or something like that would be nice ^.^

Also some more exact documentation/definition of btrfs' raid-levels
would be cool, as they seem to mismatch traditional raid-levels - or at
least I as an ignorant user fail to understand them...

Correct me, if im wrong...
* It seems, raid1(btrfs) is actually raid10, as there are no more than 2
copies of data, regardless of the count of devices.
** Is there a way to duplicate data n-times?
** If there are only 3 devices and the wrong device dies... is it dead?
* Whats the diffrence of raid1(btrfs) and raid10(btrfs)?
** After reading like 5 diffrent wiki pages, I understood, that there
are diffrences ... but not what they are and how they affect me :/
* Whats the diffrence of raid0(btrfs) and "normal" multi-device
operation which seems like a traditional raid0 to me?

Maybe rename/alias raid-levels that do not match traditional
raid-levels, so one cannot expect some behavior that is not there.
The extreme example is imho raid1(btrfs) vs raid1.
I would expect that if i have 5 btrfs-raid1-devices, 4 may die and btrfs
should be able to fully recover, which, if i understand correctly, by
far does not hold.
If you named that raid-level say "george" ... I would need to consult
the docs and I obviously would not expect any behavior. :)

regards,
Jan Vales
--
I only read plaintext emails.



signature.asc
Description: OpenPGP digital signature


Re: RAID56 status?

2017-01-22 Thread Waxhead

Hugo Mills wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton Mitterer wrote:

On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:

Therefore my question: whats the status of raid5/6 is in btrfs?
Is it somehow "production"-ready by now?

AFAIK, what's on the - apparently already no longer updated -
https://btrfs.wiki.kernel.org/index.php/Status still applies, and
RAID56 is not yet usable for anything near production.

It's still all valid. Nothing's changed.

How would you like it to be updated? "Nope, still broken"?

Hugo.

I risked updating the wiki to show kernel version from 4.9 instead of 
4.7 then...

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 status?

2017-01-22 Thread Christoph Anton Mitterer
On Sun, 2017-01-22 at 22:39 +, Hugo Mills wrote:
>    It's still all valid. Nothing's changed.
> 
>    How would you like it to be updated? "Nope, still broken"?

The kernel version mentioned there is 4.7... so noone (at least
endusers) really knows whether it's just no longer maintainer or still
up-to-date with nothing changed... :(


Cherrs,
Chris

smime.p7s
Description: S/MIME cryptographic signature


Re: RAID56 status?

2017-01-22 Thread Hugo Mills
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sun, Jan 22, 2017 at 11:35:49PM +0100, Christoph Anton Mitterer wrote:
> On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:
> > Therefore my question: whats the status of raid5/6 is in btrfs?
> > Is it somehow "production"-ready by now?
> AFAIK, what's on the - apparently already no longer updated -
> https://btrfs.wiki.kernel.org/index.php/Status still applies, and
> RAID56 is not yet usable for anything near production.

   It's still all valid. Nothing's changed.

   How would you like it to be updated? "Nope, still broken"?

   Hugo.

- -- 
Hugo Mills | I went to a fight once, and an ice hockey match
hugo@... carfax.org.uk | broke out.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAEBAgAGBQJYhTRUAAoJEFheFHXiqx3kwI4P/3HqTR5Y3w3Po/t+F7BFFIAT
zBwKb0Fk/kk4YE6igSrR69qT/nPOHleYFCmcp+XDW6bDqKwrzVOWHJtiy3oDvhGK
hQoJqbgJ69LCI3nPxHi83VUDmgbFKA6WFU3d/F3L4k4UfbElv6Jz0UXo16hKZTzK
xVSh468MP7HvMGoZriWLMeS+It/y08Ojpx14sG60zCpgGEdT6czyLBPspYg1XupP
n2Q9nWPyjQnJ2c6YD+4JLYC0HhIMxAV74BXr4l7cmf1iWDB6064Q/DYhsejvJnuD
i9K0r5iHHjOc9yCGVdusVCXHBXiyBQrzDTls0jSxMN1hDmYaKo6knif2BZ7w7MfY
vqwgQJA+4XFkmcpPJccEfrqcup23RX+Gj61yEweuQnSGWTCv21WsASfwaUl49dFS
lxpX6+WsWuUZRh5Nvwt2hkRtoFhl2N2rdi0NEQfcUzj6qZD7Yg3jRNHNBZW0O8Dp
s8VtnDqXPDQatQsSAHLoTE2M8yXRoBg6asll+TBIQTvycXJ0TrEtvKxbxmZAUTqQ
yafn1wh8KFwhRuKygHhDyOn91iPKiq7vNuPCXKV0uM2oJE+0FaA2TfPXAQQkk74b
tQl2MJbIoqIRjJtQjtX+3aqQARkYno50fJJLqj03IDNuY48/sHEEkxeR+9Rjl2Xy
OK6tyMZL1nvPE3GYnUUP
=jn+E
-END PGP SIGNATURE-


signature.asc
Description: Digital signature


Re: RAID56 status?

2017-01-22 Thread Christoph Anton Mitterer
On Sun, 2017-01-22 at 22:22 +0100, Jan Vales wrote:
> Therefore my question: whats the status of raid5/6 is in btrfs?
> Is it somehow "production"-ready by now?
AFAIK, what's on the - apparently already no longer updated - 
https://btrfs.wiki.kernel.org/index.php/Status still applies, and
RAID56 is not yet usable for anything near production.

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature