Re: [patch 01/99] btrfs: Add btrfs_panic()

2011-11-23 Thread David Brown

On Wed, Nov 23, 2011 at 07:35:34PM -0500, Jeff Mahoney wrote:

As part of the effort to eliminate BUG_ON as an error handling
technique, we need to determine which errors are actual logic errors,
which are on-disk corruption, and which are normal runtime errors
e.g. -ENOMEM.

Annotating these error cases is helpful to understand and report them.

This patch adds a btrfs_panic() routine that will either panic
or BUG depending on the new -ofatal_errors={panic,bug} mount option.
Since there are still so many BUG_ONs, it defaults to BUG for now but I
expect that to change once the error handling effort has made
significant progress.


Any reason all of the commit text is indented in this series?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 01/99] btrfs: Add btrfs_panic()

2011-11-23 Thread David Brown

On Wed, Nov 23, 2011 at 09:22:06PM -0500, Jeff Mahoney wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/23/2011 09:05 PM, David Brown wrote:

On Wed, Nov 23, 2011 at 07:35:34PM -0500, Jeff Mahoney wrote:

As part of the effort to eliminate BUG_ON as an error handling
technique, we need to determine which errors are actual logic
errors, which are on-disk corruption, and which are normal
runtime errors e.g. -ENOMEM.

Annotating these error cases is helpful to understand and report
them.

This patch adds a btrfs_panic() routine that will either panic or
BUG depending on the new -ofatal_errors={panic,bug} mount
option. Since there are still so many BUG_ONs, it defaults to BUG
for now but I expect that to change once the error handling
effort has made significant progress.


Any reason all of the commit text is indented in this series?


Our internal patches have a bunch of RFC822-style headers associated
with them. For me, indenting the body is a style thing. I like having
the body appear separate from the headers.


Probably best not to, it makes them inconsistent with the rest of the
kernel's history when imported into git.  The body becomes the commit
text directly.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 01/99] btrfs: Add btrfs_panic()

2011-11-24 Thread David Brown

On Thu, Nov 24, 2011 at 09:36:55PM -0500, Jeff Mahoney wrote:


Probably best not to, it makes them inconsistent with the rest of
the kernel's history when imported into git.  The body becomes the
commit text directly.


I'll change them to do this since you're obviously correct. You're the
first person in 10+ years to notice (or at least comment on) it,
though. ;)


Some maintainers (and Linus as well) appears to have just fixed them
up when applying the patches.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfsprogs source code

2012-01-04 Thread David Brown

On Tue, Jan 03, 2012 at 01:05:07PM -0500, Calvin Walton wrote:


The best way to get the btrfs-progs source is probably via git; Chris
Mason's repository for it can be found at
http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git


Chris,

The wiki at
https://btrfs.wiki.kernel.org/articles/b/t/r/Btrfs_source_repositories.html
still refers to a btrfs-progs-unstable.git repository, which is not
present at git.kernel.org.  Should we update this wiki, or do you have
plans on pushing an unstable repository again?

Thanks,
David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Odd behavior of subvolume find-new

2012-01-09 Thread David Brown

I've been creating some time-based snapshots, e.g.

 # btrfs subvolume snapshot @root 2012-01-09-@root

After some changes, I wanted to see what had changed, so I tried:

 # btrfs subvolume find-new @root 2012-01-09-@root
 transid marker was 37

which doesn't print anything out.  Curiously, if I make a snapshot of
the snapshot, then I get output from the delta:

 # btrfs subvolume snapshot 2012-01-09-@root tmp
 # btrfs subvolume find-new @root tmp
 . lots of output .

I haven't seen this behavior on other filesystems or subvolumes.

My intent was to filter through the small script below to compute the
size of the delta.

Thanks,
David

#! /usr/bin/perl

# Process the output of btrfs subvolume find-new and print out the
# size used by the new data.  Doesn't show delta in metadata, only the
# data itself.
use strict;

my $bytes = 0;
while (<>) {
if (/ len (\d+) /) {
$bytes += $1;
}
}
printf "%d bytes\n", $bytes;
printf "%.1f MByte\n", $bytes / 1024.0 / 1024.0;
printf "%.1f GByte\n", $bytes / 1024.0 / 1024.0 / 1024.0;

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] Btrfs: silence a compiler warning

2012-02-22 Thread David Brown

On Wed, Feb 22, 2012 at 10:30:55AM +0300, Dan Carpenter wrote:

Gcc warns that "ret" can be used uninitialized.  It can't actually be
used uninitialized because btrfs_num_copies() always returns 1 or more.

Signed-off-by: Dan Carpenter 

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 064b29b..c053e90 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -643,7 +643,7 @@ static struct btrfsic_dev_state 
*btrfsic_dev_state_hashtable_lookup(
static int btrfsic_process_superblock(struct btrfsic_state *state,
  struct btrfs_fs_devices *fs_devices)
{
-   int ret;
+   int ret = 0;


Does

int uninitialized_var(ret);

work?  The assignment to zero actually generates additional
(unnecessary) code.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting one-disk btrfs into RAID-1?

2010-10-12 Thread David Brown

On 11/10/2010 19:06, Chris Ball wrote:

Hi,

>  Is it possible to turn a 1-disk (partition) btrfs filesystem into
>  RAID-1?

Not yet, but I'm pretty sure it's on the roadmap.

- Chris.


Is it possible to view the raid levels of data and meta data for an 
existing btrfs filesystem?  It's easy to pick them when creating the 
system, but I couldn't find any way to view them afterwards.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting one-disk btrfs into RAID-1?

2010-10-12 Thread David Brown

On 12/10/2010 11:34, Tomasz Torcz wrote:

On Tue, Oct 12, 2010 at 11:32:07AM +0200, David Brown wrote:

On 11/10/2010 19:06, Chris Ball wrote:

Hi,

>   Is it possible to turn a 1-disk (partition) btrfs filesystem into
>   RAID-1?

Not yet, but I'm pretty sure it's on the roadmap.

- Chris.


Is it possible to view the raid levels of data and meta data for an
existing btrfs filesystem?  It's easy to pick them when creating the
system, but I couldn't find any way to view them afterwards.


   "btrfs f df" will show them, except for few kernel releases when the ioctl()
was broken.



I guess it's time to update my System Rescue CD to a version with the 
btrfs command rather than just btrfsctl etc.  Thanks!



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing the top-level root

2010-10-26 Thread David Brown

On Mon, Oct 25, 2010 at 03:20:58PM -0500, C Anthony Risinger wrote:


For example, right now extlinux support booting btrfs, but _only_ from
the top-level root.  if i just had a way to "swap" the top-level root
with a different subvol, i could overcome several problems i have with
users all at once:

) users install their system to the top-level root, which means it is
no longer manageable by snapshot scripts [currently]
) if the top-level root could be swapped, extlinux could then boot my
snapshot? (i'm probably wrong here)


I don't think this is a solution to the extlinux problem, but I've
moved roots into new subvolumes, basically something like this.

Root is mounted as /, I've also mounted the volume on /mounted in this
example.

  # btrfs subvolume snapshot /mounted /mounted/newrootname

Now reboot, adding the subvol option to use the newrootname.  


Go into /mounted and make sure files touced there don't show up in '/'
(we really are mounting the submount).

Then just use rm -rf to remove everything that isn't a subvol.  I
don't know of an easy way to do that, and be careful.

This doesn't really change the default root, but by making a snapshot
of it, can move all of the data elsewhere.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-19 Thread David Brown
On 19/11/13 00:25, H. Peter Anvin wrote:
> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>> Hi Peter,
>>
>> The Cauchy matrix has the mathematical property to always have itself
>> and all submatrices not singular. So, we are sure that we can always
>> solve the equations to recover the data disks.
>>
>> Besides the mathematical proof, I've also inverted all the
>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>> disks, and got an experimental confirmation of this.
>>
> 
> Nice.
> 
>>
>> The only limit is coming from the GF(2^8). You have a maximum number
>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>> you can have no more of 251 data disks. Over this limit it's not
>> possible to build a Cauchy matrix.
>>
> 
> 251?  Not 255?
> 
>> Note that instead with a Vandermonde matrix you don't have the
>> guarantee to always have all the submatrices not singular. This is the
>> reason because using power coefficients, before or late, it happens to
>> have unsolvable equations.
>>
>> You can find the code that generate the Cauchy matrix with some
>> explanation in the comments at (see the set_cauchy() function) :
>>
>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
> 
> OK, need to read up on the theoretical aspects of this, but it sounds
> promising.
> 
>   -hpa
> 

Hi all,

A while back I worked through the maths for a method of extending raid
to multiple parities, though I never got as far as implementing it in
code (other than some simple Python test code to confirm the maths).  It
is also missing the maths for simplified ways to recover data.  I've
posted a couple of times with this on the linux-raid mailing list (as
linked in this thread) - there has certainly been some interest, but
it's not easy to turn interest into hard work!

I used an obvious expansion on the existing RAID5 and RAID6 algorithms,
with parity P_n being generated from powers of 2^n.  This means that the
triple-parity version can be implemented by simply applying the RAID6
operations twice.  For a triple parity, this works well - the matrices
involved are all invertible up to 255 data disks.  Beyond that, however,
things drop off rapidly - quad parity implemented in the same way only
supports 21 data disks, and for five parity disks you need to use 0x20
(skipping 0x10) to get even 8 data disks.

This means that my method would be fine for triple parity, and would
also be efficient in implementation.

Beyond triple parity, the simple method has size limits for four parity
and is no use on anything bigger.  The Cauchy matrix method lets us go
beyond that (I haven't yet studied your code and your maths - I will do
so as soon as I have the chance, but I doubt if that will be before the
weekend).

Would it be possible to use the simple parity system for the first three
parities, and Cauchy beyond that?  That would give the best of both worlds.



The important thing to think about here is what would actually be useful
in the real world.  It is always nice to have a system that can make an
array with 251 data disks and 6 parities (and I certainly think the
maths involved is fun), but would anyone use such a beast?

Triple parity has clear use cases.  As people have moved up from raid5
to raid6, "raid7" or "raid6-3p" would be an obvious next step.  I also
see it as being useful for maintenance on raid6 arrays - if you want to
replace disks on a raid6 array you could first add a third parity disk
with an asymmetric layout, then you could replace the main disks while
keeping two disk redundancy at all times.

Quad parity is unlikely, I think - you would need a very wide array and
unusual requirements to make quad parity a better choice than a layered
system of raid10 or raid15.  At most, I think it would find use as a
temporary security while maintaining a triple-raid array.  Remember also
that such an array would be painfully slow if it ever needed to rebuild
data with four missing disks - and if it is then too slow to be usable,
then quad parity is not a useful solution.


(Obviously anyone with /real/ experience with large arrays can give
better ideas here - I like the maths of multi-parity raid, but I will
not it for my small arrays.)



Of course I will enjoy studying your maths here, and I'll try to give
some feedback on it.  But I think for implementation purposes, the
simple "powers of 4" generation of triple parity would be better than
using the Cauchy matrix - it is a clear step from the existing raid6,
and it can work fast on a wide variety of processors (people use ARMs
and other "small" cpus on raids, not just x86 with SSE3).  I believe
that would mean simpler code and fewer changes, which is always popular
with the kernel folk.

However, if it is not possible to use Cauchy matrices to get four and
more parity while keeping the same first three parities, then the
balance changes and a decision needs to be made - do we (the Linux
kernel developers, the btrfs deve

Re: Triple parity and beyond

2013-11-20 Thread David Brown
, yes.  It is only the parity generation that is
different - multiplying by powers of 2 means each step is a fast
multiply-by-two, with Horner's rule to avoid any other multiplication.
With parity 3 generated as powers of 4 or 2^-1, you have the same system
with only a slightly slower multiply-by-4 step.  With the Cauchy matrix,
you need general multiplication with different coefficients for each
disk block.  This is significantly more complex - but if it can be done
fast enough on at least a reasonable selection of processors, it's okay
to be complex.

> 
> Anyway, I cannot tell what is the best option for Linux RAID and Btrfs.
> There are for sure better qualified people in this list to say that.

I think H. Peter Anvin is the best qualified for such decisions - I
believe he has the most experience and understanding in this area.

For what it is worth, you have convinced /me/ that your Cauchy matrices
are the way to go.  I will want to study your code a bit more, and try
it out for myself, but it looks like you have a way to overcome the
limitations of the power sequence method without too big runtime costs -
and that is exactly what we need.

> I can just say that systems using multiple parity levels do exist, and
> maybe also the Linux Kernel could benefit to have such kind of support.

I certainly think so.  I think 3 parities is definitely useful, and
sometimes four would be nice.  Beyond that, I suspect "coolness" and
bragging rights (btrfs can support more parities than ZFS...) will
outweigh real-life implementations, so it is important that the
implementation does not sacrifice anything on the triple parity in order
to get 5+ parity support.  It's fine for /us/ to look at fun solutions,
but it needs to be practical too if it is going to be accepted in the
kernel.

mvh.,

David

> 
> Here some examples:
> 
> Oracle/Sun, Dell/Compellent ZFS: 3 parity drives
> NEC HydraStor: 3 parity drives
> EMC/Isilon: 4 parity drives
> Amplidata: 4 parity drives
> CleverSafe: 6 parity drives
> StreamScale/BigParity: 7 parity drives
> 
> And Btrfs with six parities would be surely cool :)
> 
> Ciao,
> Andrea
> 
> On Tue, Nov 19, 2013 at 11:16 AM, David Brown  
> wrote:
>> On 19/11/13 00:25, H. Peter Anvin wrote:
>>> On 11/18/2013 02:35 PM, Andrea Mazzoleni wrote:
>>>> Hi Peter,
>>>>
>>>> The Cauchy matrix has the mathematical property to always have itself
>>>> and all submatrices not singular. So, we are sure that we can always
>>>> solve the equations to recover the data disks.
>>>>
>>>> Besides the mathematical proof, I've also inverted all the
>>>> 377,342,351,231 possible submatrices for up to 6 parities and 251 data
>>>> disks, and got an experimental confirmation of this.
>>>>
>>>
>>> Nice.
>>>
>>>>
>>>> The only limit is coming from the GF(2^8). You have a maximum number
>>>> of disk = 2^8 + 1 - number_of_parities. For example, with 6 parities,
>>>> you can have no more of 251 data disks. Over this limit it's not
>>>> possible to build a Cauchy matrix.
>>>>
>>>
>>> 251?  Not 255?
>>>
>>>> Note that instead with a Vandermonde matrix you don't have the
>>>> guarantee to always have all the submatrices not singular. This is the
>>>> reason because using power coefficients, before or late, it happens to
>>>> have unsolvable equations.
>>>>
>>>> You can find the code that generate the Cauchy matrix with some
>>>> explanation in the comments at (see the set_cauchy() function) :
>>>>
>>>> http://sourceforge.net/p/snapraid/code/ci/master/tree/mktables.c
>>>
>>> OK, need to read up on the theoretical aspects of this, but it sounds
>>> promising.
>>>
>>>   -hpa
>>>
>>
>> Hi all,
>>
>> A while back I worked through the maths for a method of extending raid
>> to multiple parities, though I never got as far as implementing it in
>> code (other than some simple Python test code to confirm the maths).  It
>> is also missing the maths for simplified ways to recover data.  I've
>> posted a couple of times with this on the linux-raid mailing list (as
>> linked in this thread) - there has certainly been some interest, but
>> it's not easy to turn interest into hard work!
>>
>> I used an obvious expansion on the existing RAID5 and RAID6 algorithms,
>> with parity P_n being generated from powers of 2^n.  This means that the
>> triple-parity version can be implemented by simply applying the RAID6
>> operation

Re: Triple parity and beyond

2013-11-20 Thread David Brown
On 20/11/13 02:23, John Williams wrote:
> On Tue, Nov 19, 2013 at 4:54 PM, Chris Murphy 
> wrote:
>> If anything, I'd like to see two implementations of RAID 6 dual
>> parity. The existing implementation in the md driver and btrfs could
>> remain the default, but users could opt into Cauchy matrix based dual
>> parity which would then enable them an easy (and live) migration path
>> to triple parity and beyond.

Andrea's Cauchy matrix is compatible with the existing Raid6, so there
is no problem there.

I believe it would be a terrible idea to have an incompatible extension
- that would mean you could not have temporary extra parity drives with
asymmetrical layouts, which is something I see as a very useful feature.

> 
> Actually, my understanding is that Andrea's Cauchy matrix technique
> (call it C) is compatible with existing md RAID5 and RAID6 (call these
> A). It is only the non-SSSE3 triple-parity algorithm 2^-1 (call it B)
> that is incompatible with his Cauchy matrix technique.
> 
> So, you can have:
> 
> 1) A+B
> 
> or
> 
> 2) A+C
> 
> But you cannot have A+B+C

Yes, that's right.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-20 Thread David Brown
On 19/11/13 19:12, Piergiorgio Sartor wrote:
> On Mon, Nov 18, 2013 at 11:08:59PM +0100, Andrea Mazzoleni wrote:



> 
> Hi Andrea,
> 
> great job, this was exactly what I was looking for.
> 
> Do you know if there is a "fast" way not to correct
> errors, but to find them?
> 
> In RAID-6 (as per raid6check) there is an easy way
> to verify where an HDD has incorrect data.
> 

I think the way to do that is just to generate the parity blocks from
the data blocks, and compare them to the existing parity blocks.

> I suspect, for each 2 parity block it should be
> possible to find 1 error (and if this is true, then
> quad parity is more attractive than triple one).
> 
> Furthermore, my second (of first) target would
> be something like: http://www.symform.com/blog/tag/raid-96/
> 
> Which uses 32 parities (out of 96 "disks").

I believe Andrea's matrix is extensible as long as you have no more than
257 disks in total.  A mere 32 parities should not be a problem :-)

mvh.,

David


> 
> Keep going!!!
> 
> bye,
> 
> pg
> 
>>
>> Ciao,
>> Andrea
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 20/11/13 19:09, John Williams wrote:
> On Wed, Nov 20, 2013 at 2:31 AM, David Brown  wrote:
>> That's certainly a reasonable way to look at it.  We should not limit
>> the possibilities for high-end systems because of the limitations of
>> low-end systems that are unlikely to use 3+ parity anyway.  I've also
>> looked up a list of the processors that support SSE3 and PSHUFB - a lot
>> of modern "low-end" x86 cpus support it.  And of course it is possible
>> to implement general G(2^8) multiplication without PSHUFB, using a
>> lookup table - it is important that this can all work with any CPU, even
>> if it is slow.
> 
> Unfortunately, it is SSSE3 that is required for PSHUFB. The SSE3 set
> with only two-esses does not suffice. I made that same mistake when I
> first heard about Andrea's 6-parity work. SSSE3 vs. SSE3, confusing
> notation!
> 
> SSSE3 is significantly less widely supported than SSE3. Particularly
> on AMD, only the very latest CPUs seem to support SSSE3. Intel support
> for SSSE3 goes back much further than AMD support.
> 
> Maybe it is not such a big problem, since it may be possible to
> support two "roads". Both roads would include the current md RAID-5
> and RAID-6. But one road, which those lacking CPUs supporting SSSE3
> might choose, would continue on to the non-SSSE3 triple-parity 2^-1
> technique, and then dead-end. The other road would continue with the
> Cauchy matrix technique through 3-parity all the way to 6-parity.
> 
> It might even be feasible to allow someone stuck at the end of the
> non-SSSE3 road to convert to the Cauchy road. You would have to go
> through all the 2^-1 triple-parity and convert it to Cauchy
> triple-parity. But then you would be safely on the Cauchy road.
> 

I would not like to see two alternative triple-parity solutions - I
think that would lead to confusion, and a non-Cauchy triple parity would
not be extendible without a rebuild (I've talked before about the idea
of temporarily adding an extra parity drive with an asymmetric layout.
I really like the idea, so I keep pushing for it!).

I think it is better to accept that 3+ parity will be slow on processors
that don't support PSHUFB.  We should try to find the best alternative
SIMD for other realistic processors (such as on AMD chips without
PSHUFB, ARM's with NEON, PPC with Altivec, etc.) - but a simple table
lookup will always work as a fallback.  Other than that I think it is
fair to say that if you want /fast/ 3+ parity, you need a reasonably
modern non-budget-class cpu.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 20/11/13 19:34, Andrea Mazzoleni wrote:
> Hi David,
> 
>>> The choice of ZFS to use powers of 4 was likely not optimal,
>>> because to multiply by 4, it has to do two multiplications by 2.
>> I can agree with that.  I didn't copy ZFS's choice here
> David, it was not my intention to suggest that you copied from ZFS.
> Sorry to have expressed myself badly. I just mentioned ZFS because it's
> an implementation that I know uses powers of 4 to generate triple
> parity, and I saw in the code that it's implemented with two multiplication
> by 2.
> 

Andrea, I didn't take your comment as an accusation of any kind - there
is no need for any kind of apology!  It was was merely a statement of
fact - I picked powers of 4 as an obvious extension of the powers of 2
in raid6, and found it worked well.

And of course, in the open source world, copying of code and ideas is a
good thing - there is no point in re-inventing the wheel unless we can
invent a better one.  Really, I /should/ have read the ZFS
implementation and copied it!

mvh.,

David


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 21/11/13 02:28, Stan Hoeppner wrote:
> On 11/20/2013 10:16 AM, James Plank wrote:
>> Hi all -- no real comments, except as I mentioned to Ric, my tutorial
>> in FAST last February presents Reed-Solomon coding with Cauchy
>> matrices, and then makes special note of the common pitfall of
>> assuming that you can append a Vandermonde matrix to an identity
>> matrix.  Please see
>> http://web.eecs.utk.edu/~plank/plank/papers/2013-02-11-FAST-Tutorial.pdf,
>> slides 48-52.
>>
>> Andrea, does the matrix that you included in an earlier mail (the one
>> that has Linux RAID-6 in the first two rows) have a general form, or
>> did you develop it in an ad hoc manner so that it would include Linux
>> RAID-6 in the first two rows?
> 
> Hello Jim,
> 
> It's always perilous to follow a Ph.D., so I guess I'm feeling suicidal
> today. ;)
> 
> I'm not attempting to marginalize Andrea's work here, but I can't help
> but ponder what the real value of triple parity RAID is, or quad, or
> beyond.  Some time ago parity RAID's primary mission ceased to be
> surviving single drive failure, or a 2nd failure during rebuild, and
> became mitigating UREs during a drive rebuild.  So we're now talking
> about dedicating 3 drives of capacity to avoiding disaster due to
> platter defects and secondary drive failure.  For small arrays this is
> approaching half the array capacity.  So here parity RAID has lost the
> battle with RAID10's capacity disadvantage, yet it still suffers the
> vastly inferior performance in normal read/write IO, not to mention
> rebuild times that are 3-10x longer.
> 
> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
> to mirror a drive at full streaming bandwidth, assuming 300MB/s
> average--and that is probably being kind to the drive makers.  With 6 or
> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
> minimum 72 hours or more, probably over 100, and probably more yet for
> 3P.  And with larger drive count arrays the rebuild times approach a
> week.  Whose users can go a week with degraded performance?  This is
> simply unreasonable, at best.  I say it's completely unacceptable.
> 
> With these gargantuan drives coming soon, the probability of multiple
> UREs during rebuild are pretty high.  Continuing to use ever more
> complex parity RAID schemes simply increases rebuild time further.  The
> longer the rebuild, the more likely a subsequent drive failure due to
> heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
> one failure mode we're increasing the probability of another.  TANSTAFL.
>  Worse yet, RAID10 isn't going to survive because UREs on a single drive
> are increasingly likely with these larger drives, and one URE during
> rebuild destroys the array.
> 

I don't think the chances of hitting an URE during rebuild is dependent
on the rebuild time - merely on the amount of data read during rebuild.
 URE rates are "per byte read" rather than "per unit time", are they not?

I think you are overestimating the rebuild times a bit, but there is no
arguing that rebuild on parity raids is a lot more work (for the cpu,
the IO system, and the disks) than for mirror raids.

> I think people are going to have to come to grips with using more and
> more drives simply to brace the legs holding up their arrays; comes to
> grips with these insane rebuild times; or bite the bullet they so
> steadfastly avoided with RAID10.  Lots more spindles solves problems,
> but at a greater cost--again, no free lunch.
> 
> What I envision is an array type, something similar to RAID 51, i.e.
> striped parity over mirror pairs.  In the case of Linux, this would need
> to be a new distinct md/RAID level, as both the RAID5 and RAID1 code
> would need enhancement before being meshed together into this new level[1].

Shouldn't we be talking about RAID 15 here, rather than RAID 51 ?  I
interpret "RAID 15" to be like "RAID 10" - a raid5 set of raid1 mirrors,
while "RAID 51" would be a raid1 mirror of raid5 sets.  I am certain
that you mean a raid5 set of raid1 pairs - I just think you've got the
name wrong.

> 
> Potential Advantages:
> 
> 1.  Only +1 disk capacity overhead vs RAID 10, regardless of drive count

+2 disks (the raid5 parity "disk" is a raid1 pair)

> 2.  Rebuild time is the same as RAID 10, unless a mirror pair is lost
> 3.  Parity is only used during rebuild if/when a URE occurs, unless ^
> 4.  Single drive failure doesn't degrade the parity array, multiple
> failures in different mirrors doesn't degrade the parity array
> 5.  Can sustain a minimum of 3 simultaneous drive failures--both drives
> in one mirror and one drive in another mirror
> 6.  Can lose a maximum of 1/2 of the drives plus 1 drive--one more than
> RAID 10.  Can lose half the drives and still not degrade parity,
> if no two comprise one mirror
> 7.  Similar or possibly better read throughput vs triple parity RAID
> 8.  Superior write performance with drives down
> 9.  Vastly 

Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 20/11/13 22:59, Piergiorgio Sartor wrote:
> On Wed, Nov 20, 2013 at 11:44:39AM +0100, David Brown wrote:
> [...]
>>> In RAID-6 (as per raid6check) there is an easy way
>>> to verify where an HDD has incorrect data.
>>>
>>
>> I think the way to do that is just to generate the parity blocks from
>> the data blocks, and compare them to the existing parity blocks.
> 
> Uhm, the generic RS decoder should try all
> the possible combination of erasure and so
> detect the error.
> This is unfeasible already with 3 parities,
> so there are faster algorithms, I believe:
> 
> Peterson–Gorenstein–Zierler algorithm
> Berlekamp–Massey algorithm
> 
> Nevertheless, I do not know too much about
> those, so I cannot state if they apply to
> the Cauchy matrix as explained here.
> 
> bye,
> 

Ah, you are trying to find which disk has incorrect data so that you can
change just that one disk?  There are dangers with that...

<http://neil.brown.name/blog/20100211050355>

If you disagree with this blog post (and I urge you to read it in full
first), then this is how I would do a "smart" stripe recovery:


First calculate the parities from the data blocks, and compare these
with the existing parity blocks.

If they all match, the stripe is consistent.

Normal (detectable) disk errors and unrecoverable read errors get
flagged by the disk and the IO system, and you /know/ there is a problem
with that block.  Whether it is a data block or a parity block, you
re-generate the correct data and store it - that's what your raid is for.

If you have no detected read errors, and there is one parity
inconsistency, then /probably/ that block has had an undetected read
error, or it simply has not been written completely before a crash.
Either way, just re-write the correct parity.

If there are two or more parity inconsistencies, but not all parities
are in error, then you either have multiple disk or block failures, or
you have a partly-written stripe.  Any attempts at "smart" correction
will almost certainly be worse than just re-writing the new parities and
hoping that the filesystem's journal works.

If all the parities are inconsistent, then the "smart" thing is to look
for a single incorrect disk block.  Just step through the blocks one by
one - assume that block is wrong and replace it (in temporary memory,
not on disk!) with a recovered version from the other data blocks and
the parities (only the first parity is needed).  Re-calculate the other
parities and compare.  If the other parities now match, then you have
found a single inconsistent data block.  It /may/ be a good idea to
re-write this - or maybe not (see the blog post linked above).

If you don't find any single data blocks that can be "corrected" in this
way, then re-writing the parity blocks to match the disk data is
probably the least harmful fix.


Remember, this is not a general error detection and correction scheme -
it is a system targeted for a particular type of use, with particular
patterns of failure and failure causes, and particular mechanisms on top
(journalled file systems) to consider.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 21/11/13 10:54, Adam Goryachev wrote:
> On 21/11/13 20:07, David Brown wrote:
>> I can see plenty of reasons why raid15 might be a good idea, and even
>> raid16 for 5 disk redundancy, compared to multi-parity sets.  However,
>> it costs a lot in disk space.  For example, with 20 disks at 1 TB each,
>> you can have:
>>
>> raid5 = 19TB, 1 disk redundancy
>> raid6 = 18TB, 2 disk redundancy
>> raid6.3 = 17TB, 3 disk redundancy
>> raid6.4 = 16TB, 4 disk redundancy
>> raid6.5 = 15TB, 5 disk redundancy
>>
>> raid10 = 10TB, 1 disk redundancy
>> raid15 = 8TB, 3 disk redundancy
>> raid16 = 6TB, 5 disk redundancy
>>
>>
>> That's a very significant difference.
>>
>> Implementing 3+ parity does not stop people using raid15, or similar
>> schemes - it just adds more choice to let people optimise according to
>> their needs.
> BTW, as far as strange RAID type options to try and get around problems
> with failed disks, before I learned about timeout mismatches, I was
> pretty worried when my 5 disk RAID5 kept falling apart and losing a
> random member, then adding the failed disk back would work perfectly. To
> help me feel better about this, I used 5 x 500GB drives in RAID5 and
> then used the RAID5 + 1 x 2TB drive in RAID1, meaning I could afford to
> lose any two disks without losing data. Of course, now I know RAID6
> might have been a better choice, or even simply 2 x 2TB drives in RAID1 :)
> 
> In any case, I'm not sure I understand the concern with RAID 7.X (as it
> is being called, where X > 2). Certainly you will need to make 1
> computation for each stripe being written, for each value of X, so RAID
> 7.5 with 5 disk redundancy means 5 calculations for each stripe being
> written. However, given that drives are getting bigger every year, did
> we forget that we are also getting faster CPU and also more cores in a
> single "CPU package"?
> 

This is all true.  And md code is getting better at using more cores
under more circumstances, making the parity calculations more efficient.

The speed concern (which was Stan's, rather than mine) is more about
recovery and rebuild.  If you have a layered raid with raid1 pairs at
the bottom level, then recovery and rebuild (from a single failure) is
just a straight copy from one disk to another - you don't get faster
than that.  If you have a 20 + 3 parity raid, then rebuilding requires
reading a stripe from 20 disks and writing to 1 disk - that's far more
effort and is likely to take more time unless your IO system can handle
full bandwidth of all the disks simultaneously.

Similarly, performance of the array while rebuilding or degraded is much
worse for parity raids than for raids on top of raid1 pairs.

How that matters to you, and how it balances with the space costs, is up
to you and your application.

> On a pure storage server, the CPU would normally have nothing to do,
> except a little interrupt handling, it is just shuffling bytes around.
> Of course, if you need RAID7.5 then you probably have a dedicated
> storage server, so I don't see the problem with using the CPU to do all
> the calculations.
> 
> Of course, if you are asking about carbon emissions, and cooling costs
> in the data center, this could (on a global scale) have a significant
> impact, so maybe it is a bad idea after all :)
> 
> Regards,
> Adam
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 21/11/13 21:05, Piergiorgio Sartor wrote:
> On Thu, Nov 21, 2013 at 11:13:29AM +0100, David Brown wrote:
> [...]
>> Ah, you are trying to find which disk has incorrect data so that you can
>> change just that one disk?  There are dangers with that...
> 
> Hi David,
> 
>> <http://neil.brown.name/blog/20100211050355>
> 
> I think we already did the exercise, here :-)
> 
>> If you disagree with this blog post (and I urge you to read it in full
> 
> We discussed the topic (with Neil) and, if I
> recall correctly, he is agaist having an
> _automatic_ error detectio and correction _in_
> kernel.
> I fully agree with that: user space is better
> and it should not be automatic, but it should
> do things under user control.
> 

OK.

> The current "check" operetion is pretty poor.
> It just reports how many mismatches, it does
> not even report where in the array.
> The first step, independent from how many
> parities one has, would be to tell the user
> where the mismatches occurred, so it would
> be possible to check the FS at that position.

Certainly it would be good to give the user more information.  If you
can tell the user where the errors are, and what the likely failed block
is, then that would be very useful.  If you can tell where it is in the
filesystem (such as which file, if any, owns the blocks in question)
then that would be even better.

> Having a multi parity RAID allows to check
> even which disk.
> This would provide the user with a more
> comprehensive (I forgot the spelling)
> information.
> 
> Of course, since we are there, we can
> also give the option to fix it.
> This would be much likely a "fsck".

If this can all be done to give the user an informed choice, then it
sounds good.

One issue here is whether the check should be done with the filesystem
mounted and in use, or only off-line.  If it is off-line then it will
mean a long down-time while the array is checked - but if it is online,
then there is the risk of confusing the filesystem and caches by
changing the data.

> 
>> first), then this is how I would do a "smart" stripe recovery:
>>
>> First calculate the parities from the data blocks, and compare these
>> with the existing parity blocks.
>>
>> If they all match, the stripe is consistent.
>>
>> Normal (detectable) disk errors and unrecoverable read errors get
>> flagged by the disk and the IO system, and you /know/ there is a problem
>> with that block.  Whether it is a data block or a parity block, you
>> re-generate the correct data and store it - that's what your raid is for.
> 
> That's not always the case, otherwise
> having the mismatch count would be useless.
> The issue is that errors appear, whatever
> the reason, without being reported by the
> underlying hardware.
>  

(I know you know how this works, so I am not trying to be patronising
with this explanation - I just think we have slightly misunderstood what
the other is saying, so spelling it out will hopefully make it clearer.)

Most disk errors /are/ detectable, and are reported by the underlying
hardware - small surface errors are corrected by the disk's own error
checking and correcting mechanisms, and larger errors are usually
detected.  It is (or should be!) very rare that a read error goes
undetected without there being a major problem with the disk controller.
 And if the error is detected, then the normal raid processing kicks in
as there is no doubt about which block has problems.

>> If you have no detected read errors, and there is one parity
>> inconsistency, then /probably/ that block has had an undetected read
>> error, or it simply has not been written completely before a crash.
>> Either way, just re-write the correct parity.
> 
> Why re-write the parity if I can get
> the correct data there?
> If can be sure that one data block is
> incorrect and I can re-create properly,
> that's the thing to do.

If you can be /sure/ about which data block is incorrect, then I agree -
but you can't be /entirely/ sure.  But I agree that you can make a good
enough guess to recommend a fix to the user - as long as it is not
automatic.

>  
>> Remember, this is not a general error detection and correction scheme -
> 
> It is not, but it could be. For free.
> 

For most ECC schemes, you know that all your blocks are set
synchronously - so any block that does not fit in, is an error.  With
raid, it could also be that a stripe is only partly written - you can
have two different valid sets of data mixed to give an inconsistent
stripe, without any good way of telling what consistent data is the best
choice.

Perhaps a checking tool can take advantage of a write-intent bitmap (if
there is one) so that it knows if an inconsistent stripe is partly
updated or the result of a disk error.

mvh.,

David


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 21/11/13 21:52, Piergiorgio Sartor wrote:
> Hi David,
> 
> On Thu, Nov 21, 2013 at 09:31:46PM +0100, David Brown wrote:
> [...]
>> If this can all be done to give the user an informed choice, then it
>> sounds good.
> 
> that would be my target.
> To _offer_ more options to the (advanced) user.
> It _must_ always be under user control.
> 
>> One issue here is whether the check should be done with the filesystem
>> mounted and in use, or only off-line.  If it is off-line then it will
>> mean a long down-time while the array is checked - but if it is online,
>> then there is the risk of confusing the filesystem and caches by
>> changing the data.
> 
> Currently, "raid6check" can work with FS mounted.
> I got the suggestion from Neil (of course).
> It is possible to lock one stripe and check it.
> This should be, at any given time, consistent
> (that is, the parity should always match the data).
> If an error is found, it is reported.
> Again, the user can decide to fix it or not,
> considering all the FS consequences and so on.
> 

If you can lock stripes, and make sure any old data from that stripe is
flushed from the caches (if you change it while locked), then that
sounds ideal.

>> Most disk errors /are/ detectable, and are reported by the underlying
>> hardware - small surface errors are corrected by the disk's own error
>> checking and correcting mechanisms, and larger errors are usually
>> detected.  It is (or should be!) very rare that a read error goes
>> undetected without there being a major problem with the disk controller.
>>  And if the error is detected, then the normal raid processing kicks in
>> as there is no doubt about which block has problems.
> 
> That's clear. That case is an "erasure" (I think)
> and it is perfectly in line with the usual operation.
> I'm not trying to replace this mechanism.
>  
>> If you can be /sure/ about which data block is incorrect, then I agree -
>> but you can't be /entirely/ sure.  But I agree that you can make a good
>> enough guess to recommend a fix to the user - as long as it is not
>> automatic.
> 
> One typical case is when many errors are
> found, belonging to the same disk.
> This case clearly shows the disk is to be
> replaced or the interface checked...
> But, again, the user is the master, not the
> machine... :-)

I don't know what sort of interface you have for the user, but I guess
that means you'll have to collect a number of failures before showing
them so that the user can see the correlation on disk number.

>  
>> For most ECC schemes, you know that all your blocks are set
>> synchronously - so any block that does not fit in, is an error.  With
>> raid, it could also be that a stripe is only partly written - you can
> 
> Could it be?
> I would consider this an error.

It could occur as the result of a failure of some sort (kernel crash,
power failure, temporary disk problem, etc.).  More generally, md raid
doesn't have to be on local physical disks - maybe one of the "disks" is
an iSCSI drive or something else over a network that could have failures
or delays.  I haven't thought through all cases here - I am just
throwing them out as possibilities that might cause trouble.

> The stripe must always be consistent, there
> should be a transactional mechanism to make
> sure that, if read back, the data is always
> matching the parity.
> When I write "read back" I mean from whatever
> the data is: physical disk or cache.
> Otherwise, the check must run exclusively on
> the array (no mounted FS, no other things
> running on it).
> 
>> have two different valid sets of data mixed to give an inconsistent
>> stripe, without any good way of telling what consistent data is the best
>> choice.
>>  
>> Perhaps a checking tool can take advantage of a write-intent bitmap (if
>> there is one) so that it knows if an inconsistent stripe is partly
>> updated or the result of a disk error.
> 
> Of course, this is an option, which should be
> taken into consideration.
> 
> Any improvement idea is welcome!!!
> 
> bye,
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-21 Thread David Brown
On 22/11/13 01:30, Stan Hoeppner wrote:

> I don't like it either.  It's a compromise.  But as RAID1/10 will soon
> be unusable due to URE probability during rebuild, I think it's a
> relatively good compromise for some users, some workloads.

An alternative is to move to 3-way raid1 mirrors rather than 2-way
mirrors.  Obviously you take another hit in disk space efficiency, but
reads will be faster than you have extra redundancy.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-22 Thread David Brown
On 22/11/13 09:13, Stan Hoeppner wrote:
> Hi David,
> 
> On 11/21/2013 3:07 AM, David Brown wrote:
>> On 21/11/13 02:28, Stan Hoeppner wrote:
> ...
>>> WRT rebuild times, once drives hit 20TB we're looking at 18 hours just
>>> to mirror a drive at full streaming bandwidth, assuming 300MB/s
>>> average--and that is probably being kind to the drive makers.  With 6 or
>>> 8 of these drives, I'd guess a typical md/RAID6 rebuild will take at
>>> minimum 72 hours or more, probably over 100, and probably more yet for
>>> 3P.  And with larger drive count arrays the rebuild times approach a
>>> week.  Whose users can go a week with degraded performance?  This is
>>> simply unreasonable, at best.  I say it's completely unacceptable.
>>>
>>> With these gargantuan drives coming soon, the probability of multiple
>>> UREs during rebuild are pretty high.  Continuing to use ever more
>>> complex parity RAID schemes simply increases rebuild time further.  The
>>> longer the rebuild, the more likely a subsequent drive failure due to
>>> heat buildup, vibration, etc.  Thus, in our maniacal efforts to mitigate
>>> one failure mode we're increasing the probability of another.  TANSTAFL.
>>>  Worse yet, RAID10 isn't going to survive because UREs on a single drive
>>> are increasingly likely with these larger drives, and one URE during
>>> rebuild destroys the array.
> 
> 
>> I don't think the chances of hitting an URE during rebuild is dependent
>> on the rebuild time - merely on the amount of data read during rebuild.
> 
> Please read the above paragraph again, as you misread it the first time.

Yes, I thought you were saying that URE's were more likely during a
parity raid rebuild than during a mirror raid rebuild, because parity
rebuilds take longer.  They will be slightly more likely (due to more
mechanical stress on the drives), but only slightly.

> 
>>  URE rates are "per byte read" rather than "per unit time", are they not?
> 
> These are specified by the drive manufacturer, and they are per *bits*
> read, not "per byte read".  Current consumer drives are typically rated
> at 1 URE in 10^14 bits read, enterprise are 1 in 10^15.

"Per bit" or "per byte" makes no difference to the principle.

Just to get some numbers here, if we have a 20 TB drive (which don't yet
exist, AFAIK - 6 TB is the highest I have heard) with a URE rate of 1 in
10^14, that means an average of 1.6 errors per read of the whole disk.

Assuming bit errors are independent (an unwarranted assumption, I know -
but it makes the maths easier!), an URE of 1 in 10^14 gives a chance of
3.3 * 10^-10 of an error in a 4 KB sector - and a 83% chance of getting
at least one incorrect sector read out of 20 TB.  Even if enterprise
disks have lower URE rates, I think it is reasonable to worry about
URE's during a raid1 rebuild!

The probability of hitting URE's on two disks at the same spot is, of
course, tiny (given that you've got one URE, the chances of a URE in the
same sector on another disk is 3.3 * 10^-10) - so two disk redundancy
lets you have a disk failure and an URE safely.

In theory, mirror raids are safer here because you only need to worry
about a matching URE on /one/ disk.  If you have a parity array with 60
disks, the chances of a matching URE on one of the other disks is 2 *
10^-8 - higher than for mirror raids, but still not a big concern.  (Of
course, you have more chance of a complete disk failure provoked by
stresses during rebuilds, but that's another failure mode.)


What does all this mean?  Single disk redundancy, like 2-way raid1
mirrors, is not going to be good enough for bigger disks unless the
manufacturers can get their URE rates significantly lower.  You will
need an extra redundancy to be safe.  That means raid6 is a minimum, or
3-way mirrors, or stacked raids like raid15.  And if you want to cope
with a disk failure, a second disk failure due to the stresses of
rebuilding, /and/ and URE, then triple parity or raid15 is needed.


> 
>> I think you are overestimating the rebuild times a bit, but there is no
> 
> Which part?  A 20TB drive mirror taking 18 hours, or parity arrays
> taking many times longer than 18 hours?

The 18 hours for a 20 TB mirror sounds right - but that it takes 9 times
as long for a rebuild with a parity array sounds too much.  But I don't
have any figures as evidence.  And of course it varies depending on what
else you are doing with the array at the time - parity array rebuilds
will be affected much more by concurrent access to the array than
mirrored arrays.  It's all a balance - if you want cheaper space but
have less IO's and can tolerate slower

Re: Triple parity and beyond

2013-11-22 Thread David Brown
On 22/11/13 09:38, Stan Hoeppner wrote:
> On 11/21/2013 3:07 AM, David Brown wrote:
> 
>> For example, with 20 disks at 1 TB each, you can have:
> 
> All correct, and these are maximum redundancies.
> 
> Maximum:
> 
>> raid5 = 19TB, 1 disk redundancy
>> raid6 = 18TB, 2 disk redundancy
>> raid6.3 = 17TB, 3 disk redundancy
>> raid6.4 = 16TB, 4 disk redundancy
>> raid6.5 = 15TB, 5 disk redundancy
> 
> 
> These are not fully correct, because only the minimums are stated.  With
> any mirror based array one can lose half the disks as long as no two are
> in one mirror.  The probability of a pair failing together is very low,
> and this probability decreases even further as the number of drives in
> the array increases.  This is one of the many reasons RAID 10 has been
> so popular for so many years.
> 
> Minimum:
> 
>> raid10 = 10TB, 1 disk redundancy
>> raid15 = 8TB, 3 disk redundancy
>> raid16 = 6TB, 5 disk redundancy
> 
> Maximum:
> 
> RAID 10 = 10 disk redundancy
> RAID 15 = 11 disk redundancy

12 disks maximum (you have 8 with data, the rest are mirrors, parity, or
mirrors of parity).

> RAID 16 = 12 disk redundancy

14 disks maximum (you have 6 with data, the rest are mirrors, parity, or
mirrors of parity).


> 
> Range:
> 
> RAID 10 = 1-10 disk redundancy
> RAID 15 = 3-11 disk redundancy
> RAID 16 = 5-12 disk redundancy
> 
> 

Yes, I know these are the minimum redundancies.  But that's a vital
figure for reliability (even if the range is important for statistical
averages).  When one disk in a raid10 array fails, your main concern is
about failures or URE's in the other half of the pair - it doesn't help
to know that another nine disks can "safely" fail too.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-23 Thread David Brown

On 22/11/13 23:59, NeilBrown wrote:

On Fri, 22 Nov 2013 10:07:09 -0600 Stan Hoeppner 
wrote:




In the event of a double drive failure in one mirror, the RAID 1 code
will need to be modified in such a way as to allow the RAID 5 code to
rebuild the first replacement disk, because the RAID 1 device is still
in a failed state.  Once this rebuild is complete, the RAID 1 code will
need to switch the state to degraded, and then do its standard rebuild
routine for the 2nd replacement drive.

Or, with some (likely major) hacking it should be possible to rebuild
both drives simultaneously for no loss of throughput or additional
elapsed time on the RAID 5 rebuild.


Nah, that would be minor hacking.  Just recreate the RAID1 in a state that is
not-insync, but with automatic-resync disabled.
Then as continuous writes arrive, move the "recovery_cp" variable forward
towards the end of the array.  When it reaches the end we can safely mark the
whole array as 'in-sync' and forget about diabling auto-resync.

NeilBrown



That was my thoughts here.  I don't know what state the planned "bitmap 
of non-sync regions" feature is in, but if and when it is implemented, 
you would just create the replacement raid1 pair without any 
synchronisation.  Any writes to the pair (such as during a raid5 
rebuild) would get written to both disks at the same time.


David


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-25 Thread David Brown
On 24/11/13 22:13, Stan Hoeppner wrote:
> On 11/23/2013 11:14 PM, John Williams wrote:
>> On Sat, Nov 23, 2013 at 8:03 PM, Stan Hoeppner  
>> wrote:
>>
>>> Parity array rebuilds are read-modify-write operations.  The main
>>> difference from normal operation RMWs is that the write is always to the
>>> same disk.  As long as the stripe reads and chunk reconstruction outrun
>>> the write throughput then the rebuild speed should be as fast as a
>>> mirror rebuild.  But this doesn't appear to be what people are
>>> experiencing.  Parity rebuilds would seem to take much longer.
>>
>> "This" doesn't appear to be what SOME people, who have reported
>> issues, are experiencing. Their issues must be examined on a case by
>> case basis.
> 
> Given what you state below this may very well be the case.
> 
>> But I, and a number of other people I have talked to or corresponded
>> with, have had mdadm RAID 5 or RAID 6 rebuilds of one drive run at
>> approximately the optimal sequential write speed of the replacement
>> drive. It is not unusual on a reasonably configured system.
> 
> I freely admit I may have drawn an incorrect conclusion about md parity
> rebuild performance based on incomplete data.  I simply don't recall
> anyone stating here in ~3 years that their parity rebuilds were speedy,
> but quite the opposite.  I guess it's possible that each one of those
> cases was due to another factor, such as user load, slow CPU, bus
> bottleneck, wonky disk firmware, backplane issues, etc.
> 

Maybe this is just reporting bias - people are quick to post about
problems such as slow rebuilds, but very seldom send a message saying
everything worked perfectly!

There /are/ reasons why parity raid rebuilds are going to be slower than
mirror rebuilds - delays on one disk reading is one issue, and I expect
that simultaneous use of the array for normal work will have more impact
on parity raid rebuild times than on a mirror array (certainly compared
to raid10 with multiple pairs).  I just don't think it is quite as bad
as you think.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-25 Thread David Brown
On 25/11/13 03:14, Russell Coker wrote:
> On Mon, 25 Nov 2013, Stan Hoeppner  wrote:
>>> If that is the problem then the solution would be to just enable
>>> read-ahead. Don't we already have that in both the OS and the disk
>>> hardware?  The hard- drive read-ahead buffer should at least cover the
>>> case where a seek completes but the desired sector isn't under the
>>> heads.
>>
>> I'm not sure if read-ahead would solve such a problem, if indeed this is
>> a possible problem.  AFAIK the RAID5/6 drivers process stripes serially,
>> not asynchronously, so I'd think the rebuild may still stall for ms at a
>> time in such a situation.
> 
> For a RAID block device (such as Linux software RAID) read-ahead should work 
> well.  For a RAID type configuration managed by the filesystem where you 
> might 
> have different RAID levels in the same filesystem it might not be possible.
> 
> It would be a nice feature to have RAID-0 for unimportant files and RAID-1 or 
> RAID-6 for important files on the same filesystem.  But that type of thing 
> would really complicate RAID rebuild.
> 

I think btrfs is planning to have such features - different files can
have different raid types.  It certainly supports different raid levels
for metadata and file data.  But it is definitely a feature you want on
the filesystem level, rather than the raid block device level.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Triple parity and beyond

2013-11-28 Thread David Brown
On 28/11/13 08:16, Stan Hoeppner wrote:
> Late reply.  This one got lost in the flurry of activity...
> 
> On 11/22/2013 7:24 AM, David Brown wrote:
>> On 22/11/13 09:38, Stan Hoeppner wrote:
>>> On 11/21/2013 3:07 AM, David Brown wrote:
>>>
>>>> For example, with 20 disks at 1 TB each, you can have:
>>>
> ...
>>> Maximum:
>>>
>>> RAID 10 = 10 disk redundancy
>>> RAID 15 = 11 disk redundancy
>>
>> 12 disks maximum (you have 8 with data, the rest are mirrors, parity, or
>> mirrors of parity).
>>
>>> RAID 16 = 12 disk redundancy
>>
>> 14 disks maximum (you have 6 with data, the rest are mirrors, parity, or
>> mirrors of parity).
> 
> We must follow different definitions of "redundancy".  I view redundancy
> as the number of drives that can fail without taking down the array.  In
> the case of the above 20 drive RAID15 that maximum is clearly 11
> drives-- one of every mirror and both of one mirror can fail.  The 12th
> drive failure kills the array.
> 

No, we have the same definitions of redundancy - just different
definitions of basic arithmetic.  Your definition is a bit more common!

My error was actually in an earlier email, when I listed the usable
capacities of different layouts for 20 x 1TB drive.  I wrote:

> raid10 = 10TB, 1 disk redundancy
> raid15 = 8TB, 3 disk redundancy
> raid16 = 6TB, 5 disk redundancy

Of course, it should be:

raid10 = 10TB, 1 disk redundancy
raid15 = 9TB, 3 disk redundancy
raid16 = 8TB, 5 disk redundancy


So it is your fault for not spotting my earlier mistake :-)



>>> Range:
>>>
>>> RAID 10 = 1-10 disk redundancy
>>> RAID 15 = 3-11 disk redundancy
>>> RAID 16 = 5-12 disk redundancy
>>
>> Yes, I know these are the minimum redundancies.  But that's a vital
>> figure for reliability (even if the range is important for statistical
>> averages).  When one disk in a raid10 array fails, your main concern is
>> about failures or URE's in the other half of the pair - it doesn't help
>> to know that another nine disks can "safely" fail too.
> 
> Knowing this is often critical from an architectural standpoint David.
> It is quite common to create the mirrors of a RAID10 across two HBAs and
> two JBOD chassis.  Some call this "duplexing".  With RAID10 you know you
> can lose one HBA, one cable, one JBOD (PSU, expander, etc) and not skip
> a beat.  "RAID15" would work the same in this scenario.
> 

That is absolutely true, and I agree that it is very important when
setting up big arrays.  You have to make decisions like where you split
your raid1 pairs - putting them on different controllers/chassis means
you can survive the loss of a whole half of the system.  On the other
hand, putting them on the same controller could mean hardware raid1 is
more efficient and you don't need to duplicate the traffic over the
higher level interfaces.

But here we are looking at one specific class of failures - hard disk
failures (including complete disk failure and URE's).  For that, the
redundancy is the number of disks that can fail without data loss,
assuming the worst possible combination of failures.  And given the
extra stress on the disks during degraded access or rebuilds, "bad"
combinations are more likely than "good" combinations.

So I think it is of little help to say that a 20 disk raid 15 can
survive up to 11 disk failures.  It is far more interesting to say that
it can survive any 3 random disk failures, and (if connected as you
describe with two controllers and chassis) it can also survive the
complete failure of a chassis or controller while still retaining a one
disk redundancy.


As a side issue here, I wonder if a write intent bitmap can be used for
a chassis failure so that when the chassis is fixed (the controller card
replaced, the cable re-connected, etc.) the disks inside can be brought
up to sync again without a full rebuild.

> This architecture is impossible with RAID5/6.  Any of the mentioned
> failures will kill the array.
> 

Yes.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can moving data to a subvolume not take as long as a fully copy?

2013-01-14 Thread David Brown
Marc MERLIN  writes:

> I made a mistake and copied data in the root of a new btrfs filesystem.
> I created a subvolume, and used mv to put everything in there.
> Something like:
> cd /mnt
> btrfs subvolume create dir
> mv * dir
>
> Except it's been running for over a day now (ok, it's 5TB of data)
>
> Looks like mv is really copying all the data as if it were an entirely
> different filesystem.
>
> Is there not a way to short circuit this and only update the metadata?

Why not make a snapshot of the root volume, and then delete the files
you want to move from the original root, and delete the rest of root
from the snapshot?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.14.0rc3: did not find backref in send_root

2014-05-05 Thread David Brown

On Mon, Feb 24, 2014 at 10:36:52PM -0800, Marc MERLIN wrote:

I got this during a btrfs send:
BTRFS error (device dm-2): did not find backref in send_root. inode=22672, 
offset=524288, disk_byte=1490517954560 found extent=1490517954560

I'll try a scrub when I've finished my backup, but is there anything I
can run on the file I've found from the inode?

gargamel:/mnt/dshelf1/Sound# btrfs inspect-internal inode-resolve  -v 22672 
file.mp3
ioctl ret=0, bytes_left=3998, bytes_missing=0, cnt=1, missed=0
file.mp3


I've just seen this error:

  BTRFS error (device sda4): did not find backref in send_root. inode=411890, 
offset=307200, disk_byte=48100618240 found extent=48100618240

during a send between two snapshots I have.

after moving to 3.14.2.  I've seen it on two filesystems now since
moving to 3.14.  I have the two readonly snapshots if there is
anything helpful I can figure out from them.

Scrub reports no errors, but I don't seem to be able to back up
anything now.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.14.0rc3: did not find backref in send_root

2014-05-10 Thread David Brown

On Mon, May 05, 2014 at 11:10:54PM -0700, David Brown wrote:

On Mon, Feb 24, 2014 at 10:36:52PM -0800, Marc MERLIN wrote:

I got this during a btrfs send:
BTRFS error (device dm-2): did not find backref in send_root. inode=22672, 
offset=524288, disk_byte=1490517954560 found extent=1490517954560

I'll try a scrub when I've finished my backup, but is there anything I
can run on the file I've found from the inode?

gargamel:/mnt/dshelf1/Sound# btrfs inspect-internal inode-resolve  -v 22672 
file.mp3
ioctl ret=0, bytes_left=3998, bytes_missing=0, cnt=1, missed=0
file.mp3


I've just seen this error:

 BTRFS error (device sda4): did not find backref in send_root. inode=411890, 
offset=307200, disk_byte=48100618240 found extent=48100618240

during a send between two snapshots I have.

after moving to 3.14.2.  I've seen it on two filesystems now since
moving to 3.14.  I have the two readonly snapshots if there is
anything helpful I can figure out from them.


After bisecting it seems to be caused by

   commit 7ef81ac86c8a44ab9f4e6e04e1f4c9ea53615b8a
   Author: Josef Bacik 
   Date:   Fri Jan 24 14:05:42 2014 -0500

   Btrfs: only process as many file extents as there are refs

   The backref walking code will search down to the key it is looking for 
and then
   proceed to walk _all_ of the extents on the file until it hits the end.  
This is
   suboptimal with large files, we only need to look for as many extents as 
we have
   references for that inode.  I have a testcase that creates a randomly 
written 4
   gig file and before this patch it took 6min 30sec to do the initial 
send, with
   this patch it takes 2min 30sec to do the intial send.  Thanks,

   Signed-off-by: Josef Bacik 
   Signed-off-by: Chris Mason 

This appears to be fixed in 3.15-rc5, but I wonder if either the extra
fixes, or a revert of the above should be applied to 3.14 stable?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3.15-rc5 btrfs send/receive corruption errors? Does scrub

2014-05-10 Thread David Brown

On Sat, May 10, 2014 at 04:57:18PM -0700, Marc MERLIN wrote:


On Fri, May 09, 2014 at 11:39:13AM -0700, Anacron wrote:

/etc/cron.daily/btrfs-scrub:
scrub device /dev/mapper/cryptroot (id 1) done
scrub started at Fri May  9 06:09:14 2014 and finished after 19153 
seconds
total bytes scrubbed: 646.15GiB with 0 errors


So, does scrub actually make sure everything on my filesystem is sane,
or can it miss some kinds of corruptions?


Does scrub make sure _anything_ on the filesystem is sane?  I guess it
would detect some failures because the tree is incorrect, but I
thought scrub was mostly about making sure the data match the
checksums.

Just curious, the future "online filesystem check", will that be part
of scrub, or another command.  It seems like it would be common to
want a faster integrity check that doesn't have to read all of the
data as well.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lsetxattr error when doing send/receive

2014-05-13 Thread David Brown

On Tue, May 13, 2014 at 08:44:44PM -0300, Bernardo Donadio wrote:

Hi!

I'm trying to do a send/receive of a snapshot between two disks on 
Fedora 20 with Linux 3.15-rc5 (and also tried with 3.14 and 3.11) and 
SELinux disabled, and then I'm receiving the following error:


[root@darwin /]# btrfs subvolume snapshot -r / @.$(date 
+%Y-%m-%d-%H%M%S)Create a readonly snapshot of '/' in 
'./@.2014-05-13-203532'

[root@darwin /]# btrfs send @.2014-05-13-203532 | btrfs receive /mnt/cold/
At subvol @.2014-05-13-203532
At subvol @.2014-05-13-203532
ERROR: lsetxattr bin security.selinux=system_u:object_r:bin_t:s0 
failed. Operation not supported


I'm missing something? Is this a bug?


Is selinux 'disabled' or just non-enforcing?  If it is enabled, but
even non-enforcing, it still won't allow the security attributes to be
set.

  $ selinuxenabled; echo $?

should give '1' if it is truly disabled.  I believe you have to
disable it at startup time, so if you've changed the config file, you
might need to reboot.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: lsetxattr error when doing send/receive

2014-05-14 Thread David Brown

On Wed, May 14, 2014 at 12:52:50AM -0600, Chris Murphy wrote:


On May 13, 2014, at 7:57 PM, David Brown  wrote:


On Tue, May 13, 2014 at 08:44:44PM -0300, Bernardo Donadio wrote:

Hi!

I'm trying to do a send/receive of a snapshot between two disks on Fedora 20 
with Linux 3.15-rc5 (and also tried with 3.14 and 3.11) and SELinux disabled, 
and then I'm receiving the following error:

[root@darwin /]# btrfs subvolume snapshot -r / @.$(date +%Y-%m-%d-%H%M%S)Create 
a readonly snapshot of '/' in './@.2014-05-13-203532'
[root@darwin /]# btrfs send @.2014-05-13-203532 | btrfs receive /mnt/cold/
At subvol @.2014-05-13-203532
At subvol @.2014-05-13-203532
ERROR: lsetxattr bin security.selinux=system_u:object_r:bin_t:s0 failed. 
Operation not supported

I'm missing something? Is this a bug?


Is selinux 'disabled' or just non-enforcing?  If it is enabled, but
even non-enforcing, it still won't allow the security attributes to be
set.


Reverse that. If selinux is disabled, labels can't be set. If not
enforcing, you won't get AVC denials for the vast majority of events,
but labels can be set and e.g. restorecon will still work.


  $ selinuxenabled ; echo $?
  0
  $ touch /var/tmp/foo
  $ sudo setfattr -n security.selinux -v system_u:object_r:bin_t:s0 /var/tmp/foo
  $ ls -lZ /var/tmp/foo
  -rw-rw-r--. davidb davidb system_u:object_r:bin_t:s0  /var/tmp/foo

and on a machine with selinux disabled:

  $ selinuxenabled ; echo $?
  1
  $ touch /var/tmp/foo
  $ sudo setfattr -n security.selinux -v system_u:object_r:bin_t:s0 /var/tmp/foo
  $ ls -lZ /var/tmp/foo
  -rw-rw-r--. davidb davidb system_u:object_r:bin_t:s0  /var/tmp/foo

so it doesn't actually seem to matter.  At this point, I'm suspecting
this was actually a bug in a kernel I was running at some point, and I
just haven't bothered trying to enable selinux since then.  I
definitely have received errors in the past from rsync that look like
the above error that I could fix by booting with selinux disabled.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs send/receive still gets out of sync in 3.14.0

2014-04-22 Thread David Brown

On Sat, Mar 22, 2014 at 02:04:56PM -0700, Marc MERLIN wrote:

After deleting a huge directory tree in my /home subvolume, syncing
snapshots now fails with:

ERROR: rmdir o1952777-157-0 failed. No such file or directory
Error line 156 with status 1

DIE: Code dump:
  153   if [[ -n "$init" ]]; then
  154   btrfs send "$src_newsnap" | $ssh btrfs receive "$dest_pool/"
  155   else
  156   btrfs send -p "$src_snap" "$src_newsnap" | $ssh btrfs receive 
"$dest_pool/"
  157   fi
  158   
  159   # We make a read-write snapshot in case you want to use it for a chroot


Is there anything useful I can provide before killing my snapshot and doing
a full sync again?


I have been able to work around this by hacking up btrfs receive to
ignore the rmdir.  As far as I can tell (tree comparison) the
resulting tree is correct.

David

diff --git a/cmds-receive.c b/cmds-receive.c
index d6cd3da..5bd4161 100644
--- a/cmds-receive.c
+++ b/cmds-receive.c
@@ -492,6 +492,9 @@ static int process_rmdir(const char *path, void *user)
fprintf(stderr, "ERROR: rmdir %s failed. %s\n", path,
strerror(-ret));
}
+   // Ugly hack to work around kernel problem of sending
+   // redundant rmdirs.
+   ret = 0;
 
 	free(full_path);

return ret;
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Content based storage

2010-03-16 Thread David Brown

Hi,

I was wondering if there has been any thought or progress in 
content-based storage for btrfs beyond the suggestion in the "Project 
ideas" wiki page?


The basic idea, as I understand it, is that a longer data extent 
checksum is used (long enough to make collisions unrealistic), and merge 
data extents with the same checksums.  The result is that "cp foo bar" 
will have pretty much the same effect as "cp --reflink foo bar" - the 
two copies will share COW data extents - as long as they remain the 
same, they will share the disk space.  But you can still access each 
file independently, unlike with a traditional hard link.


I can see at least three cases where this could be a big win - I'm sure 
there are more.


Developers often have multiple copies of source code trees as branches, 
snapshots, etc.  For larger projects (I have multiple "buildroot" trees 
for one project) this can take a lot of space.  Content-based storage 
would give the space efficiency of hard links with the independence of 
straight copies.  Using "cp --reflink" would help for the initial 
snapshot or branch, of course, but it could not help after the copy.


On servers using lightweight virtual servers such as OpenVZ, you have 
multiple "root" file systems each with their own copy of "/usr", etc. 
With OpenVZ, all the virtual roots are part of the host's file system 
(i.e., not hidden within virtual disks), so content-based storage could 
merge these, making them very much more efficient.  Because each of 
these virtual roots can be updated independently, it is not possible to 
use "cp --reflink" to keep them merged.


For backup systems, you will often have multiple copies of the same 
files.  A common scheme is to use rsync and "cp -al" to make hard-linked 
(and therefore space-efficient) snapshots of the trees.  But sometimes 
these things get out of synchronisation - perhaps your remote rsync dies 
halfway, and you end up with multiple independent copies of the same 
files.  Content-based storage can then re-merge these files.



I would imagine that content-based storage will sometimes be a 
performance win, sometimes a loss.  It would be a win when merging 
results in better use of the file system cache - OpenVZ virtual serving 
would be an example where you would be using multiple copies of the same 
file at the same time.  For other uses, such as backups, there would be 
no performance gain since you seldom (hopefully!) read the backup files. 
 But in that situation, speed is not a major issue.



mvh.,

David

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Content based storage

2010-03-17 Thread David Brown

On 16/03/2010 23:45, Fabio wrote:

Some years ago I was searching for that kind of functionality and found
an experimental ext3 patch to allow the so-called COW-links:
http://lwn.net/Articles/76616/



I'd read about the COW patches for ext3 before.  While there is 
certainly some similarity here, there are a fair number of differences. 
 One is that those patches were aimed only at copying - there was no 
way to merge files later.  Another is that it was (as far as I can see) 
just an experimental hack to try out the concept.  Since it didn't take 
off, I think it is worth learning from, but not building on.



There was a discussion later on LWN http://lwn.net/Articles/77972/
an approach like COW-links would break POSIX standards.



I think a lot of the problems here were concerning inode numbers.  As 
far as I understand it, when you made an ext3-cow copy, the copy and the 
original had different inode numbers.  That meant the userspace programs 
saw them as different files, and you could have different owners, 
attributes, etc., while keeping the data linked.  But that broke a 
common optimisation when doing large diff's - thus some people wanted to 
have the same inode for each file and that /definitely/ broke posix.


With btrfs, the file copies would each have their own inode - it would, 
I think, be posix compliant as it is transparent to user programs.  The 
diff optimisation discussed in the articles you sited would not work - 
but if btrfs becomes the standard Linux file system, then user 
applications like diff can be extended with btrfs-specific optimisations 
if necessary.



I am not very technical and don't know if it's feasible in btrfs.


Nor am I very knowledgeable in this area (most of my programming is on 
8-bit processors), but I believe btrfs is already designed to support 
larger checksums (32-bit CRCs are not enough to say that data is 
identical), and the "cp --reflink" shows how the underlying link is made.



I think most likely you'll have to run an userspace tool to find and
merge identical files based on checksums (which already sounds good to me).


This sounds right to me.  In fact, it would be possible to do today, 
entirely from within user space - but files would need to be compared 
long-hand before merging.  With larger checksums, the userspace daemon 
would be much more efficient.



The only thing we can ask the developers at the moment is if something
like that would be possible without changes to the on-disk format.



I guess that's partly why I made these posts!



PS. Another great scenario is shared hosting web/file servers: ten of
thousand website with mostly the same tiny PHP Joomla files.
If you can get the benefits of: compression + "content based"/cowlinks +
FS Cache... That would really make Btrfs FLY on Hard Disk and make SSD
devices possible for storage (because of the space efficiency).



That's a good point.

People often think that hard disk space is cheap these days - but being 
space efficient means you can use an SSD instead of a hard disk.  And 
for on-disk backups, it means you can use a small number of disks even 
though the users think "I've got a huge hard disk, I can make lots of 
copies of these files" !


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Content based storage

2010-03-17 Thread David Brown

On 17/03/2010 01:45, Hubert Kario wrote:

On Tuesday 16 March 2010 10:21:43 David Brown wrote:

Hi,

I was wondering if there has been any thought or progress in
content-based storage for btrfs beyond the suggestion in the "Project
ideas" wiki page?

The basic idea, as I understand it, is that a longer data extent
checksum is used (long enough to make collisions unrealistic), and merge
data extents with the same checksums.  The result is that "cp foo bar"
will have pretty much the same effect as "cp --reflink foo bar" - the
two copies will share COW data extents - as long as they remain the
same, they will share the disk space.  But you can still access each
file independently, unlike with a traditional hard link.

I can see at least three cases where this could be a big win - I'm sure
there are more.

Developers often have multiple copies of source code trees as branches,
snapshots, etc.  For larger projects (I have multiple "buildroot" trees
for one project) this can take a lot of space.  Content-based storage
would give the space efficiency of hard links with the independence of
straight copies.  Using "cp --reflink" would help for the initial
snapshot or branch, of course, but it could not help after the copy.

On servers using lightweight virtual servers such as OpenVZ, you have
multiple "root" file systems each with their own copy of "/usr", etc.
With OpenVZ, all the virtual roots are part of the host's file system
(i.e., not hidden within virtual disks), so content-based storage could
merge these, making them very much more efficient.  Because each of
these virtual roots can be updated independently, it is not possible to
use "cp --reflink" to keep them merged.

For backup systems, you will often have multiple copies of the same
files.  A common scheme is to use rsync and "cp -al" to make hard-linked
(and therefore space-efficient) snapshots of the trees.  But sometimes
these things get out of synchronisation - perhaps your remote rsync dies
halfway, and you end up with multiple independent copies of the same
files.  Content-based storage can then re-merge these files.


I would imagine that content-based storage will sometimes be a
performance win, sometimes a loss.  It would be a win when merging
results in better use of the file system cache - OpenVZ virtual serving
would be an example where you would be using multiple copies of the same
file at the same time.  For other uses, such as backups, there would be
no performance gain since you seldom (hopefully!) read the backup files.
   But in that situation, speed is not a major issue.


mvh.,

David


 From what I could read, content based storage is supposed to be in-line
deduplication, there are already plans to do (probably) a userland daemon
traversing the FS and merging indentical extents -- giving you post-process
deduplication.

For a rather heavy used host (such as a VM host) you'd probably want to use
post-process dedup -- as the daemon can be easly stopped or be given lower
priority. In line dedup is quite CPU intensive.

In line dedup is very nice for backup though -- you don't need the temporary
storage before the (mostly unchanged) data is deduplicated.


I think post-process deduplication is the way to go here, using a 
userspace daemon.  It's the most flexible solution.  As you say, inline 
dedup could be nice in some cases, such as for backups, since the cpu 
time cost is not an issue there.  However, in a typical backup 
situation, the new files are often written fairly slowly (for remote 
backups).  Even for local backups, there is generally not that much 
/new/ data, since you normally use some sort of incremental backup 
scheme (such as rsync, combined with cp -al or cp --reflink).  Thus it 
should be fine to copy over the data, then de-dup it later or in the 
background.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


UUID of subvolumes

2010-04-08 Thread David Brown

I am developing backup software (<http://github.com/d3zd3z/jpool> for
the curious), and have been doing some testing with btrfs.

Jpool currently uses the blkid database to map between device numbers
(st_rdev) and the uuid of a particular filesystem.  I originally
created this because LVM device numbers sometimes changed.  Jpool
uses the uuid to track files within a tree.

The subvolumes on btrfs seem to be getting ephemeral device numbers,
which aren't listed in the blkid output.  The program falls back to
using the mountpoint, but that misses mountpoints changing.

  - Do subvolumes in btrfs even have separate uuids, and should they?

  - Is there any way for me to map a particular st_rdev value to a
particular filesystem/subvolume?

  - btrfs seems to allow me to rename subvolumes, so this doesn't seem
like a particularly good value to use as a key.  The device number
can change depending on what else might be used.

  - Any other ideas on a unique key I could use for a given subvolume
to identify the files on that volume, even if it moves around?

Thanks,
David Brown
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/btrfs: Return EPERM for rmdir on subvolumes and snapshots

2010-04-08 Thread David Brown

On Thu, Apr 08, 2010 at 01:35:31PM -0700, Harshavardhana wrote:



if (inode->i_size > BTRFS_EMPTY_DIR_SIZE ||
inode->i_ino == BTRFS_FIRST_FREE_OBJECTID)
-   return -ENOTEMPTY;
+   return -EPERM;


Don't you want to still return ENOTEMPTY for the size check, and only
the EPERM on the root of subvolume?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confused about resizing

2010-05-26 Thread David Brown

On 26/05/2010 10:41, David Pottage wrote:


On Wed, May 26, 2010 3:46 am, Charlie Brune wrote:

I think I'm not understanding something fundamental about btrfs: what am I
able to resize?  Resizing would be nice, given that it's so hard to do
with ext3 (or even LVM).

I created a btrfs filesystem on my 32G thumbdrive (/dev/sdb):

[snip]


BUT, what's the point of resizing the filesystem with something like:

  btrfsctl -r 15g /mnt/btrfs

???

After I do it, I'm assuming that there's roughly 17G in /dev/sdb1 that I'm
not using, but I don't know how to get to it.  Can I make *another*
filesystem on /dev/sdb1 and then mount it to somewhere like /mnt/btrfs2.


After shrinking the filesystem on /dev/sdb1 to 15G, you could then run a
disk partiton tool on your thumbdrive so that the /dev/sdb1 partition is
also 15G. After that you could create other partition(s) in the remaining
space, and put other filing systems there.

Thumb drives are a fairly poor example, because most people use them as
single volumes, and if they find that their thumb drive is the wrong size,
they just buy another.

A better example would be a file server. Suppose you are administering a
linux file server for an engineering company. There is a 300G disc split
between the design and the marketing departments. Currently the designers
have 200G for their CAD designs, and the marketing people have 100G for
sales brochures.

The designers complete a product design, and archive most of their old CAD
data to backup tapes, and now the marketing people need more disc space to
put together more brochures to sell it, so you need to shrink the Design
volume to 100G, and increase the marketing volume up to 200G.

With other Linux file-systems it was possible to resize volumes, but only
if the volume is offline, so for the resize described above, you would
need to go to the office at the weekend. With btrfs the resizing can be
done while the system is online, so there would be no need for you to give
up your weekend.



Some other file systems, including reiserfs3 and ext3/4, can be 
increased in size while online.  But they must be taken offline for a 
shrink, which is a very slow operation.  If btrfs can shrink online, 
that's a very nice feature.



A slight fly in the ointment is that currently btrfs only supports
extending or shrinking a filing system from the end so in order to do the
resize above the logical partitions hosting the volumes would have to be
under an LVM, so that the physical blocks could be stored on the disc out
of order.



If you are expecting to change file system sizes, LVM makes things very 
much easier - it is far easier to create, delete or resize logical disks 
than to edit your partitions.


If you are shrinking a file system (regardless of the method), it is 
best to first shrink it to a bit less than your target size.  Then 
resize the partition, then grow the file system to fit the partition. 
That way you avoid accidentally chopping off the end of your file system 
due to rounding errors (things like block size, rounding to the nearest 
cylinder, mix-ups between GB as 2^30 bytes or 10^9 bytes, etc.).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: default subvolume abilities/restrictions

2010-06-12 Thread David Brown

On Sat, Jun 12, 2010 at 06:06:23PM -0500, C Anthony Risinger wrote:


# btrfs subvolume create new_root
# mv . new_root/old_root



can i at least get confirmation that the above is possible?


I've had no problem with

  # btrfs subvolume snapshot . new_root
  # mkdir old_root
  # mv * old_root
  # rm -rf old_root

Make sure the 'mv' fails fo move new_root, and I'd look at the
new_root before removing everything.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confused by performance

2010-06-16 Thread David Brown

On 16/06/2010 21:35, Freddie Cash wrote:



That's all well and good, but you missed the part where he said ext2
on a 5-way LVM stripeset is many times faster than btrfs on a 5-way
btrfs stripeset.

IOW, same 5-way stripeset, different filesystems and volume managers,
and very different performance.

And he's wondering why the btrfs method used for striping is so much
slower than the lvm method used for striping.



This could easily be explained by Roberto's theory and maths - if the 
lvm stripe set used large stripe sizes so that the random reads were 
mostly read from a single disk, it would be fast.  If the btrfs stripes 
were small, then it would be slow due to all the extra seeks.


Do we know anything about the stripe sizes used?


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hardlinks-per-directory limit?

2010-08-01 Thread David Brown
On Wednesday 28 July 2010, Ken D'Ambrosio said:

> Hello, all.  I'm thinking of rolling out a BackupPC server, and --
> based on the strength of the recent Phoronix benchmarks
> (http://benchmarkreviews.com/index.php?option=com_content&task=view&id=11156&Itemid=23)
> -- had been strongly considering btrfs.  But I do seem to recall
> that there was some sort of hardlinks-per-directory limitation, and
> BackupPC *loves* hardlinks.  Would someone care to either remind me
> what the issue was, or reassure me that it's been rectified?

btrfs has a limit on the number of hardlinks that can exist in the
same directory.  I don't believe that BackupPC will create any more
hardlinks in a given directory than are already in the filesystem you
are backing up.  It uses hardlinks between directories for files that
haven't changed.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS && SSD

2010-09-30 Thread David Brown

On 29/09/2010 23:31, Yuehai Xu wrote:

On Wed, Sep 29, 2010 at 3:59 PM, Sean Bartell  wrote:

On Wed, Sep 29, 2010 at 02:45:29PM -0400, Yuehai Xu wrote:

On Wed, Sep 29, 2010 at 1:08 PM, Sean Bartell  wrote:

On Wed, Sep 29, 2010 at 11:30:14AM -0400, Yuehai Xu wrote:

I know BTRFS is a kind of Log-structured File System, which doesn't do
overwrite. Here is my question, suppose file A is overwritten by A',
instead of writing A' to the original place of A, a new place is
selected to store it. However, we know that the address of a file
should be recorded in its inode. In such case, the corresponding part
in inode of A should update from the original place A to the new place
A', is this a kind of overwrite actually? I think no matter what
design it is for Log-Structured FS, a mapping table is always needed,
such as inode map, DAT, etc. When a update operation happens for this
mapping table, is it actually a kind of over-write? If it is, is it a
bottleneck for the performance of write for SSD?


In btrfs, this is solved by doing the same thing for the inode--a new
place for the leaf holding the inode is chosen. Then the parent of the
leaf must point to the new position of the leaf, so the parent is moved,
and the parent's parent, etc. This goes all the way up to the
superblocks, which are actually overwritten one at a time.


You mean that there is no over-write for inode too, once the inode
need to be updated, this inode is actually written to a new place
while the only thing to do is to change the point of its parent to
this new place. However, for the last parent, or the superblock, does
it need to be overwritten?


Yes. The idea of copy-on-write, as used by btrfs, is that whenever
*anything* is changed, it is simply written to a new location. This
applies to data, inodes, and all of the B-trees used by the filesystem.
However, it's necessary to have *something* in a fixed place on disk
pointing to everything else. So the superblocks can't move, and they are
overwritten instead.



So, is it a bottleneck in the case of SSD since the cost for over
write is very high? For every write, I think the superblocks should be
overwritten, it might be much more frequent than other common blocks
in SSD, even though SSD will do wear leveling inside by its FTL.



SSDs already do copy-on-write.  They can't change small parts of the 
data in a block, but have to re-write the block.  While that could be 
done by reading the whole erase block to a ram buffer, changing the 
data, erasing the flash block, then re-writing, this is not what happens 
in practice.  To make efficient use of write blocks that are smaller 
than erase blocks, and to provide wear levelling, the flash disk will 
implement a small change to a block by writing a new copy of the 
modified block to a different part of the flash, then updating its block 
indirection tables.


BTRFS just makes this process a bit more explicit (except for superblock 
writes).



What I current know is that for Intel x25-V SSD, the write throughput
of BTRFS is almost 80% less than the one of EXT3 in the case of
PostMark. This really confuses me.



Different file systems have different strengths and weaknesses.  I 
haven't actually tested BTRFS much, but my understanding is that it will 
be significantly slower than EXT in certain cases, such as small 
modifications to large files (since copy-on-write means a lot of extra 
disk activity in such cases).  But for other things it is faster.  Also 
remember that BTRFS is under development - optimising for raw speed 
comes at a lower priority than correctness and safety of data, and 
implementation of BTRFS features.  Once everyone is happy with the 
stability of the file system and its functionality and tools, you can 
expect the speed to improve somewhat over time.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html