Re: Problems with incremental send/receive

2014-01-09 Thread Felix Blanke
Hi Wang,

here are the versioninformation:

server log # btrfs version
Btrfs v3.12-dirty
server log # uname -a
Linux server.home 3.12.6-hardened-r3 #1 SMP Thu Jan 2 13:16:48 CET
2014 x86_64 Intel(R) Celeron(R) CPU G1610 @ 2.60GHz GenuineIntel
GNU/Linux

This should work if I understood you correct?

Regards,
Felix

On Thu, Jan 9, 2014 at 12:36 PM, Felix Blanke  wrote:
> Hi Wang,
>
> thank you for your answer.
>
> I am using the latest btrfs-progs with the 3.12 kernel. I don't have
> access to the machine right now (it looks like it crashed :/) but I
> can send the exact versions when I'm home.
>
> Regards,
> Felix
>
> On Thu, Jan 9, 2014 at 3:10 AM, Wang Shilong  
> wrote:
>> Hi Felix,
>>
>> It seems some reported this problem before. The problem for your below case
>> is because you use latest btrfs-progs(v3.12?), which will need kernel
>> update,
>> kernel 3.12 is ok.
>>
>> However, i think btrfs-progs should keep compatibility, i will send a patch
>> to
>> make things more friendly.
>>
>> Thanks,
>> Wang
>>
>> On 01/09/2014 06:04 AM, Felix Blanke wrote:
>>>
>>> Hi List,
>>>
>>> My backup stopped working and I can't figure out why. I'm using
>>> send/receive with the "-p" switch for incremental backups using the
>>> last snapshot as a parent snapshot for sending only the changed data.
>>>
>>> The problem occurs using my own backup script. After I discovered the
>>> problem I did a quick test using the exact commands from the wiki with
>>> the same result: It doesn't work. Here is the output:
>>>
>>>
>>> server ~ # ./test_snapshot.sh
>>> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/
>>> /mnt/root1/snapshots/test
>>> Create a readonly snapshot of '/mnt/root1/@root_home/' in
>>> '/mnt/root1/snapshots/test'
>>> ++ sync
>>> ++ btrfs send /mnt/root1/snapshots/test
>>> ++ btrfs receive /mnt/backup1/
>>> At subvol /mnt/root1/snapshots/test
>>> At subvol test
>>> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/
>>> /mnt/root1/snapshots/test_new
>>> Create a readonly snapshot of '/mnt/root1/@root_home/' in
>>> '/mnt/root1/snapshots/test_new'
>>> ++ sync
>>> ++ btrfs send -p /mnt/root1/snapshots/test /mnt/root1/snapshots/test_new
>>> ++ btrfs receive /mnt/backup1/
>>> At subvol /mnt/root1/snapshots/test_new
>>> At snapshot test_new
>>> ERROR: open @/test failed. No such file or directory
>>>
>>> I don't get where the "@/" in front of the snapshot name comes from.
>>> It could be that I had a subvolume named @, but this doesn't exists
>>> anymore and I don't understand why this would be important for the
>>> send/receive.
>>>
>>> Some more details about the fs:
>>>
>>> server ~ # btrfs subvol list /mnt/root1/
>>> ID 259 gen 568053 top level 5 path @root
>>> ID 261 gen 568053 top level 5 path @var
>>> ID 263 gen 568049 top level 5 path @home
>>> ID 302 gen 568053 top level 5 path @owncloud_chroot
>>> ID 421 gen 568038 top level 5 path @root_home
>>> ID 30560 gen 563661 top level 5 path snapshots/home_2014-01-06-19:33_d
>>> ID 30561 gen 563665 top level 5 path
>>> snapshots/owncloud_chroot_2014-01-06-19:34_d
>>> ID 30562 gen 563674 top level 5 path
>>> snapshots/root_home_2014-01-06-19:38_d
>>> ID 30563 gen 563675 top level 5 path snapshots/var_2014-01-06-19:39_d
>>> ID 30564 gen 563697 top level 5 path snapshots/root_2014-01-06-19:50_d
>>>
>>> server ~ # btrfs subvol get-default /mnt/root1/
>>> ID 5 (FS_TREE)
>>>
>>> server ~ # ls -l /mnt/root1/
>>> total 0
>>> drwxr-xr-x. 1 root root  30 May 10  2013 @home
>>> drwxr-xr-x. 1 root root 134 Jan  5 19:27 @owncloud_chroot
>>> drwxr-xr-x. 1 root root 204 Nov 24 18:16 @root
>>> drwx--. 1 root root 468 Jan  8 22:47 @root_home
>>> drwxr-xr-x. 1 root root 114 Oct  7 17:39 @var
>>> drwx--. 1 root root 420 Jan  8 22:50 snapshots
>>>
>>>
>>> Any ideas? Thanks in advance.
>>>
>>>
>>> Regards,
>>> Felix
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message tomajord...@vger.kernel.org
>>> More majordomo info athttp://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


backpointer mismatch

2014-01-09 Thread Peter van Hoof

Hi,

I am using btrfs for my backup RAID. This had been running well for 
about a year. Recently I decided the upgrade the backup server to 
openSUSE 13.1. I checked all filesystems before the upgrade and 
everything was clean. I had several attempts at upgrading the system, 
but all failed (the installation of some rpm would hang indefinitely). 
So I aborted the installation and reverted the system back to openSUSE 
12.3 (with a custom-installed 3.9.7 kernel). Unfortunately, after this 
the backup RAID reported lots of errors.


When I run btrfsck on the filesystem, I get around 1.3M of these messages:

Extent back ref already exists for 1116254208 parent 11145490432 root 0

and around 1.2M of these:

ref mismatch on [90670907392 4096] extent item 11, found 12
Incorrect global backref count on 90670907392 found 11 wanted 12
backpointer mismatch on [90670907392 4096]

Filtering these out, this is the remaining output:

checking extents
Errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
checking csums
checking root refs
Checking filesystem on /dev/md2
UUID: 0b6a9d0d-e501-4a23-9d09-259b1f5b5652
found 2213988384746 bytes used err is 0
total csum bytes: 3185850148
total tree bytes: 42770862080
total fs tree bytes: 36787625984
total extent tree bytes: 1643925504
btree space waste bytes: 12475940633
file data blocks allocated: 5269432860672
 referenced 5254870626304
Btrfs v3.12+20131125

(this version of btrfsck comes from openSUSE factory).

I also ran btrfs scrub on the file system. This uncovered 4 checksum 
errors which I could repair manually. I do not know if that is related 
to the problem above. At least it didn't solve it...


The btrfs file system is installed on top of an mdadm RAID5.

How worried should I be about the reported errors? What confuses me is 
that in the end btrfsck reports an error count of 0.


Should I try to repair this? I have had bad experiences in the past with 
"btrfsck --repair", but that was with a much older version...


I can of course recreate the backups, but this would take a long time 
and I would loose my entire snapshot history which I would rather avoid...



Cheers,

Peter.

--
Peter van Hoof
Royal Observatory of Belgium
Ringlaan 3
1180 Brussel
Belgium
http://homepage.oma.be/pvh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Donation!

2014-01-09 Thread adriang . bayford
My wife and I won £148m on the Euromillions lottery & will be donating £1.5 
Million each to you and four other individuals in our ongoing charity project, 
get back to us for more details on how you can receieve your donation. 

See article for more info - http://www.bbc.co.uk/news/uk-england-19254228

Best regards, 
Adrian & Gillian Bayford
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Mitchell

On 01/09/2014 05:06 PM, Jim Salter wrote:

On Jan 9, 2014 7:46 PM, George Mitchell  wrote:

I would prefer that the drive, even flash media type, would
catch and resolve write failures.  If it doesn't happen at the hardware
layer, according to how I understand Hugo's answer, btrfs, at least for
now, is not capable of it.

Not sure what you mean by this. If a bit flips on a btrfs-raid1 block, btrfs 
will detect it. Then it checks the mirror's copy of that block. It returns the 
good copy, then immediately writes the good copy over the bad copy.

I know this because I tested it directly just last week by flipping a bit in an 
offline btrfs filesystem manually. When I brought the volume back online and 
read the file containing the bit I flipped, it operated exactly as described, 
and logged its actions in kern.log, . :-)
Jim, my point was that IF the drive does not successfully resolve the 
bad block issue and btrfs takes a write failure every time it attempts 
to overwrite the bad data, it is not going to remap that data, but 
rather it is going to fail the drive.  In other words, if the drive has 
a bad sector which it has not done anything about at the drive level, 
btrfs will not remap the sector.  It will, rather, fail the drive. Is 
that not correct?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Mitchell

Hello Clemens,

On 01/09/2014 04:08 PM, Clemens Eisserer wrote:

Hi George,


I really suspect a lot of bad block issues can be avoided by monitoring
SMART data.  SMART is working very well for me with btrfs formatted drives.
SMART will detect when sectors silently fail and as those failures
accumulate, SMART will warn in an obvious way that the drive in question is
at end of life.  So I think the whole bad block issue should ideally be
handled at a lower level than filesystem with modern hard drives.

At least my original request was about cheap flash media, where you
don't have the luxury that you can "trust" the hardware behaving
properly. In fact, it might be benefitial for a SD card to not report
ECC errors - most likely the user won't notice a small glitch playing
back music - but he definitively will when the smartphone reports read
errors and stopping playback which will cause that card to be RMAd.

Also, wouldn't your argument be also valid for checksums - why
checksum in software, when in theory the drive + controllers should do
it anyway ;)

Regards, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


It would certainly be a vast improvement if flash media had some of the 
sanity checking capability that conventional media has, but to say that 
these sorts of problems with flash media are legendary would almost be 
an understatement.


As for checksums, I view them more as a tool to detect data decay as 
opposed to checking for failed writes.  Of course that data decay might 
well result in failed writes when btrfs scrub tries to correct it.  At 
that point I would prefer that the drive, even flash media type, would 
catch and resolve write failures.  If it doesn't happen at the hardware 
layer, according to how I understand Hugo's answer, btrfs, at least for 
now, is not capable of it.  I believe it is true that filesystems 
historically done bad blocking, but I do think it is moving now to the 
hardware layer which is probably the best place for it to be and the 
flash drive industry needs to solve this problem at the 
hardware/firmware level.  That is my opinion anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Clemens Eisserer
Hi George,

> I really suspect a lot of bad block issues can be avoided by monitoring
> SMART data.  SMART is working very well for me with btrfs formatted drives.
> SMART will detect when sectors silently fail and as those failures
> accumulate, SMART will warn in an obvious way that the drive in question is
> at end of life.  So I think the whole bad block issue should ideally be
> handled at a lower level than filesystem with modern hard drives.

At least my original request was about cheap flash media, where you
don't have the luxury that you can "trust" the hardware behaving
properly. In fact, it might be benefitial for a SD card to not report
ECC errors - most likely the user won't notice a small glitch playing
back music - but he definitively will when the smartphone reports read
errors and stopping playback which will cause that card to be RMAd.

Also, wouldn't your argument be also valid for checksums - why
checksum in software, when in theory the drive + controllers should do
it anyway ;)

Regards, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS_SEARCH_ARGS_BUFSIZE too small

2014-01-09 Thread Gerhard Heift
Hello,

I'm playing around with the BTRFS_IOC_SEARCH_TREE to extract the csums
of the physical blocks. During the tests some item_header had len = 0,
which indicates the buffer was to small to hold the item. I added a
printk into the kernel to get the original size of the item and it was
around 6600 bytes. Is there another way to get the item? Otherwise I
would suggest to create an ioctl, which is a little bit more flexible,
something like

struct btrfs_ioctl_search_args2 {
struct btrfs_ioctl_search_key key;
__u64 buf_len
char buf[0];
};

Gerhard
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FILE_EXTENT_SAME changes mtime and ctime

2014-01-09 Thread Gerhard Heift
2014/1/6 David Sterba :
> On Mon, Jan 06, 2014 at 12:02:51AM +0100, Gerhard Heift wrote:
>> I am currently playing with snapshots and manual deduplication of
>> files. During these tests I noticed the change of ctime and mtime in
>> the snapshot after the deduplication with FILE_EXTENT_SAME. Does this
>> happens on purpose? Otherwise I would like to have ctime and mtime
>> left unmodified, because on a read only snapshot I cannot change them
>> back after the ioctl call.
>
> I'm not sure what's the correct behaviour wrt timestamps and extent
> cloning. The inode metadata are modified in some way, but the stat data
> and actual contents are left unchanged, so the timestamps do not reflect
> that something changed according to their definition (stat(2)).
>
> On the other hand, the differences can be seen in the extent listing,
> the physical offset of the blocks will change. I'm not aware of any
> tools that would become broken by breaking this assumption. Also, the
> (partial) cloning functionality is not implemented anywhere so we could
> have a look and try to stay consistent with that.
>
> My oppinion is to drop the mtime/iversion updates completely.

In my opinion, we should never update, if we dedup content of files
with FILE_EXTENT_SAME.

If we clone with CLONE(_RANGE), the mtime should be updated, because
its like a write operation.

The semantics of ctime is not completely clear to me. It should change
if the "visible" meta data of a file changes. In this cases should it
be updated if write to it, because mtime changes, or only if the size
of the file changes?

> david

Gerhard
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Mitchell
I really suspect a lot of bad block issues can be avoided by monitoring 
SMART data.  SMART is working very well for me with btrfs formatted 
drives.  SMART will detect when sectors silently fail and as those 
failures accumulate, SMART will warn in an obvious way that the drive in 
question is at end of life.  So I think the whole bad block issue should 
ideally be handled at a lower level than filesystem with modern hard drives.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Help repairing corrupt btrfs -- btrfsck --repair doesn't change anything

2014-01-09 Thread Zack Weinberg
I have a btrfs partition with what *sounds* like minor damage; btrfsck
--repair prints

| enabling repair mode
| Checking filesystem on /dev/sda
| UUID: ec93d2c2-7937-40f8-aaa6-
c20c9775d93a
| checking extents
| checking free space cache
| cache and super generation don't match, space cache will be invalidated
| checking fs roots
| root 258 inode 4493802 errors 400, nbytes wrong
| root 258 inode 4509858 errors 400, nbytes wrong
| root 258 inode 4510014 errors 400, nbytes wrong
| root 258 inode 4838894 errors 400, nbytes wrong
| root 258 inode 4838895 errors 400, nbytes wrong
| found 41852229430 bytes used err is 1
| total csum bytes: 619630328
| total tree bytes: 3216027648
| total fs tree bytes: 2342981632
| total extent tree bytes: 135536640
| btree space waste bytes: 767795634
| file data blocks allocated: 1744289230848
|  referenced 631766474752
| Btrfs v3.12

The trouble is, these errors do not go away if I run btrfsck --repair
a second time, which implies that the problems have not actually been
corrected.  As the corruption has already caused one kernel oops (see
https://bugzilla.kernel.org/show_bug.cgi?id=68411 ) I am reluctant to
remount the file system until I am sure no corruption remains.  I did
try mounting it (and immediately unmounting it again) and that did
seem to do some sort of additional check ("checking UUID tree") but a
subsequent btrfsck run still prints the same errors.

Any advice on how to fix this so it stays fixed would be appreciated.

zw
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: skip non-regular files while defragmenting

2014-01-09 Thread Pascal VITOUX
Skip non-regular files to avoid ioctl errors while defragmenting.

They are silently ignored in recursive mode but reported as errors when
used as command-line arguments.

Signed-off-by: Pascal VITOUX 
---
 cmds-filesystem.c | 26 --
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 1c1926b..54fba10 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -646,7 +646,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
int e = 0;
int fd = 0;
 
-   if (typeflag == FTW_F) {
+   if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
fd = open(fpath, O_RDWR);
@@ -748,6 +748,7 @@ static int cmd_defrag(int argc, char **argv)
defrag_global_range.flags |= BTRFS_DEFRAG_RANGE_START_IO;
 
for (i = optind; i < argc; i++) {
+   struct stat st;
dirstream = NULL;
fd = open_file_or_dir(argv[i], &dirstream);
if (fd < 0) {
@@ -757,16 +758,21 @@ static int cmd_defrag(int argc, char **argv)
close_file_or_dir(fd, dirstream);
continue;
}
+   if (fstat(fd, &st)) {
+   fprintf(stderr, "ERROR: failed to stat %s - %s\n",
+   argv[i], strerror(errno));
+   defrag_global_errors++;
+   close_file_or_dir(fd, dirstream);
+   continue;
+   }
+   if (!(S_ISDIR(st.st_mode) || S_ISREG(st.st_mode))) {
+   fprintf(stderr, "ERROR: %s is not a directory or a 
regular "
+   "file.\n", argv[i]);
+   defrag_global_errors++;
+   close_file_or_dir(fd, dirstream);
+   continue;
+   }
if (recursive) {
-   struct stat st;
-
-   if (fstat(fd, &st)) {
-   fprintf(stderr, "ERROR: failed to stat %s - 
%s\n",
-   argv[i], strerror(errno));
-   defrag_global_errors++;
-   close_file_or_dir(fd, dirstream);
-   continue;
-   }
if (S_ISDIR(st.st_mode)) {
ret = nftw(argv[i], defrag_callback, 10,
FTW_MOUNT | FTW_PHYS);
-- 
1.8.5.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: setup inode location during btrfs_init_inode_locked

2014-01-09 Thread Chris Mason
We have a race during inode init because the BTRFS_I(inode)->location is setup
after the inode hash table lock is dropped.  btrfs_find_actor uses the location
field, so our search might not find an existing inode in the hash table if we
race with the inode init code.

This commit things to setup the location field sooner.  Also the find actor now
uses only the location objectid to match inodes.  For inode hashing, we just
need a unique and stable test, it doesn't have to reflect the inode numbers we
show to userland.

Signed-off-by: Chris Mason 

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9eaa1c8..8010b49 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -60,7 +60,7 @@
 #include "hash.h"
 
 struct btrfs_iget_args {
-   u64 ino;
+   struct btrfs_key *location;
struct btrfs_root *root;
 };
 
@@ -4932,7 +4932,9 @@ again:
 static int btrfs_init_locked_inode(struct inode *inode, void *p)
 {
struct btrfs_iget_args *args = p;
-   inode->i_ino = args->ino;
+   inode->i_ino = args->location->objectid;
+   memcpy(&BTRFS_I(inode)->location, args->location,
+  sizeof(*args->location));
BTRFS_I(inode)->root = args->root;
return 0;
 }
@@ -4940,19 +4942,19 @@ static int btrfs_init_locked_inode(struct inode *inode, 
void *p)
 static int btrfs_find_actor(struct inode *inode, void *opaque)
 {
struct btrfs_iget_args *args = opaque;
-   return args->ino == btrfs_ino(inode) &&
+   return args->location->objectid == BTRFS_I(inode)->location.objectid &&
args->root == BTRFS_I(inode)->root;
 }
 
 static struct inode *btrfs_iget_locked(struct super_block *s,
-  u64 objectid,
+  struct btrfs_key *location,
   struct btrfs_root *root)
 {
struct inode *inode;
struct btrfs_iget_args args;
-   unsigned long hashval = btrfs_inode_hash(objectid, root);
+   unsigned long hashval = btrfs_inode_hash(location->objectid, root);
 
-   args.ino = objectid;
+   args.location = location;
args.root = root;
 
inode = iget5_locked(s, hashval, btrfs_find_actor,
@@ -4969,13 +4971,11 @@ struct inode *btrfs_iget(struct super_block *s, struct 
btrfs_key *location,
 {
struct inode *inode;
 
-   inode = btrfs_iget_locked(s, location->objectid, root);
+   inode = btrfs_iget_locked(s, location, root);
if (!inode)
return ERR_PTR(-ENOMEM);
 
if (inode->i_state & I_NEW) {
-   BTRFS_I(inode)->root = root;
-   memcpy(&BTRFS_I(inode)->location, location, sizeof(*location));
btrfs_read_locked_inode(inode);
if (!is_bad_inode(inode)) {
inode_tree_add(inode);

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Murphy

On Jan 9, 2014, at 12:13 PM, Kyle Gates  wrote:

> On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote:
>> 
>> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote:
>> 
>>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
 Hi,
 
 I am running write-intensive (well sort of, one write every 10s)
 workloads on cheap flash media which proved to be horribly unreliable.
 A 32GB microSDHC card reported bad blocks after 4 days, while a usb
 pen drive returns bogus data without any warning at all.
 
 So I wonder, how would btrfs behave in raid1 on two such devices?
 Would it simply mark bad blocks as "bad" and continue to be
 operational, or will it bail out when some block can not be
 read/written anymore on one of the two devices?
>>> 
>>> If a block is read and fails its checksum, then the other copy (in
>>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>>> use the good data.
>>> 
>>> If the block is bad such that writing to it won't fix it, then
>>> there's probably two cases: the device returns an IO error, in which
>>> case I suspect (but can't be sure) that the FS will go read-only. Or
>>> the device silently fails the write and claims success, in which case
>>> you're back to the situation above of the block failing its checksum.
>> 
>> In a normally operating drive, when the drive firmware locates a physical 
>> sector with persistent write failures, it's dereferenced. So the LBA points 
>> to a reserve physical sector, the originally can't be accessed by LBA. If 
>> all of the reserve sectors get used up, the next persistent write failure 
>> will result in a write error reported to libata and this will appear in 
>> dmesg, and should be treated as the drive being no longer in normal 
>> operation. It's a drive useful for storage developers, but not for 
>> production usage.
>> 
>>> There's no marking of bad blocks right now, and I don't know of
>>> anyone working on the feature, so the FS will probably keep going back
>>> to the bad blocks as it makes CoW copies for modification.
>> 
>> This is maybe relevant:
>> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html
>> 
>> "READ and WRITE commands report CHS or LBA of the first failed sector but 
>> ATA/ATAPI standard specifies that the amount of transferred data on error 
>> completion is indeterminate, so we cannot assume that sectors preceding the 
>> failed sector have been transferred and thus cannot complete those sectors 
>> successfully as SCSI does."
>> 
>> If I understand that correctly, Btrfs really ought to either punt the 
>> device, or make the whole volume read-only. For production use, going 
>> read-only very well could mean data loss, even while preserving the state of 
>> the file system. Eventually I'd rather see the offending device ejected from 
>> the volume, and for the volume to remain rw,degraded.
> 
> I would like to see btrfs hold onto the device in a read-only state like is 
> done during a device replace operation. New writes would maintain the raid 
> level but go out to the remaining devices and only go full filesystem 
> read-only if the minimum number of writable devices is not met. Once a new 
> device is added in, the replace operation could commence and drop the bad 
> device when complete.   

Sure that's a fine optimization for a bad device to be read-only while the 
volume is still rw, if that's possible.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Kyle Gates
On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote:
>
> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote:
>
>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>>> Hi,
>>>
>>> I am running write-intensive (well sort of, one write every 10s)
>>> workloads on cheap flash media which proved to be horribly unreliable.
>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
>>> pen drive returns bogus data without any warning at all.
>>>
>>> So I wonder, how would btrfs behave in raid1 on two such devices?
>>> Would it simply mark bad blocks as "bad" and continue to be
>>> operational, or will it bail out when some block can not be
>>> read/written anymore on one of the two devices?
>>
>> If a block is read and fails its checksum, then the other copy (in
>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>> use the good data.
>>
>> If the block is bad such that writing to it won't fix it, then
>> there's probably two cases: the device returns an IO error, in which
>> case I suspect (but can't be sure) that the FS will go read-only. Or
>> the device silently fails the write and claims success, in which case
>> you're back to the situation above of the block failing its checksum.
>
> In a normally operating drive, when the drive firmware locates a physical 
> sector with persistent write failures, it's dereferenced. So the LBA points 
> to a reserve physical sector, the originally can't be accessed by LBA. If all 
> of the reserve sectors get used up, the next persistent write failure will 
> result in a write error reported to libata and this will appear in dmesg, and 
> should be treated as the drive being no longer in normal operation. It's a 
> drive useful for storage developers, but not for production usage.
>
>> There's no marking of bad blocks right now, and I don't know of
>> anyone working on the feature, so the FS will probably keep going back
>> to the bad blocks as it makes CoW copies for modification.
>
> This is maybe relevant:
> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html
>
> "READ and WRITE commands report CHS or LBA of the first failed sector but 
> ATA/ATAPI standard specifies that the amount of transferred data on error 
> completion is indeterminate, so we cannot assume that sectors preceding the 
> failed sector have been transferred and thus cannot complete those sectors 
> successfully as SCSI does."
>
> If I understand that correctly, Btrfs really ought to either punt the device, 
> or make the whole volume read-only. For production use, going read-only very 
> well could mean data loss, even while preserving the state of the file 
> system. Eventually I'd rather see the offending device ejected from the 
> volume, and for the volume to remain rw,degraded.

I would like to see btrfs hold onto the device in a read-only state like is 
done during a device replace operation. New writes would maintain the raid 
level but go out to the remaining devices and only go full filesystem read-only 
if the minimum number of writable devices is not met. Once a new device is 
added in, the replace operation could commence and drop the bad device when 
complete.   --
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Murphy

On Jan 9, 2014, at 11:22 AM, Austin S Hemmelgarn  wrote:

> On 2014-01-09 13:08, Chris Murphy wrote:
>> 
>> On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.dun...@cox.net> wrote:
>>> Having checksumming is good, and a second 
>>> copy in case one fails the checksum is nice, but what if they BOTH do?
>>> I'd love to have the choice of (at least) three-way-mirroring, as for me 
>>> that seems the best practical hassle/cost vs. risk balance I could get, 
>>> but it's not yet possible. =:^(
>> 
>> I'm on the fence on n-way. 
>> 
>> HDDs get bigger at a faster rate than their performance improves, so rebuild 
>> times keep getting higher. For cases where the data is really important, 
>> backup-restore doesn't provide the necessary uptime, and minimum single 
>> drive performance is needed, it can make sense to want three copies.
>> 
>> But what's the probability of both drives in a mirrored raid set dying, 
>> compared to something else in the storage stack dying? I think at 3 copies, 
>> you've got other risks that the 3rd copy doesn't manage, like a power 
>> supply, controller card, or logic board dying.
>> 
> The risk isn't as much both drives dying at the same time as one dying
> during a rebuild of the array, which is more and more likely as drives
> get bigger and bigger.

Understood. I'm considering a 2nd drive dying during rebuild (from a 1st drive 
dying) as essentially simultaneous failures. And in the case of raid10, the 
likelihood of a 2nd drive failure being the lonesome drive in a mirrored set is 
statistically very unlikely. The next drive to fail is going to be some other 
drive in the array, which still has a mirror.

I'm not saying there's no value in n-way. I'm just saying adding more 
redundancy only solves on particular vector for failure that's still probably 
less likely than losing a power supply or a controller or even user induced 
data loss that ends up affecting all three copies anyway.

And yes, it's easier to just add drives and make 3 copies, than it is to setup 
a cluster. But that's the trade off when using such high density drives that 
the rebuild times cause consideration of adding even more high density drives 
to solve the problem. 




Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Eleftheriou
Thanks Hugo,

Since:

-- i keep daily backups
-- all 4 devices are of the same size

I think I can test it (as soon as I have some time to spend in the
transition to BTRFS) and verify your assumptions (...and get my wish)



>If you have an even number of devices and all the devices are the
> same size, then:
>
>  * the block group allocator will use all the devices each time
>  * the amount of space on each device will always be the same
>
> If the sort in the allocator is stable and resolves ties in free space
> by using the device ID number, the above properties should guarantee
> that the allocation is stable, so each new block group will have the
> same functional chunk on the same device, and you get your wish.
>
>It's been a few months since I looked at that code, but I don't
> recall seeing anything directly contradictory to the above
> assumptions.
>
>Of course, if you have an odd number of devices, the allocator will
> omit a different device on each block group, and you lose the ability
> to survive (some) two-device failures. I suspect that the odds of
> surviving a two-device failure are still non-zero, but less than if
> you had an even number of devices. I'm not about to attempt an
> ab-initio computation of the probabilities, but it shouldn't be too
> hard to do either a monte-carlo simulation or a simple brute-force
> enumeration of the possibilities for a given configuration.
>
>Hugo.
>
> --
> === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
>   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
>---  My code is never released,  it escapes from the ---
>   git repo and kills a few beta testers on the way out
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Murphy

On Jan 9, 2014, at 3:42 AM, Hugo Mills  wrote:

> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>> Hi,
>> 
>> I am running write-intensive (well sort of, one write every 10s)
>> workloads on cheap flash media which proved to be horribly unreliable.
>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
>> pen drive returns bogus data without any warning at all.
>> 
>> So I wonder, how would btrfs behave in raid1 on two such devices?
>> Would it simply mark bad blocks as "bad" and continue to be
>> operational, or will it bail out when some block can not be
>> read/written anymore on one of the two devices?
> 
>   If a block is read and fails its checksum, then the other copy (in
> RAID-1) is checked and used if it's good. The bad copy is rewritten to
> use the good data.
> 
>   If the block is bad such that writing to it won't fix it, then
> there's probably two cases: the device returns an IO error, in which
> case I suspect (but can't be sure) that the FS will go read-only. Or
> the device silently fails the write and claims success, in which case
> you're back to the situation above of the block failing its checksum.

In a normally operating drive, when the drive firmware locates a physical 
sector with persistent write failures, it's dereferenced. So the LBA points to 
a reserve physical sector, the originally can't be accessed by LBA. If all of 
the reserve sectors get used up, the next persistent write failure will result 
in a write error reported to libata and this will appear in dmesg, and should 
be treated as the drive being no longer in normal operation. It's a drive 
useful for storage developers, but not for production usage.

>   There's no marking of bad blocks right now, and I don't know of
> anyone working on the feature, so the FS will probably keep going back
> to the bad blocks as it makes CoW copies for modification.

This is maybe relevant:
https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html

"READ and WRITE commands report CHS or LBA of the first failed sector but 
ATA/ATAPI standard specifies that the amount of transferred data on error 
completion is indeterminate, so we cannot assume that sectors preceding the 
failed sector have been transferred and thus cannot complete those sectors 
successfully as SCSI does."

If I understand that correctly, Btrfs really ought to either punt the device, 
or make the whole volume read-only. For production use, going read-only very 
well could mean data loss, even while preserving the state of the file system. 
Eventually I'd rather see the offending device ejected from the volume, and for 
the volume to remain rw,degraded.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Austin S Hemmelgarn
On 2014-01-09 13:08, Chris Murphy wrote:
> 
> On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Having checksumming is good, and a second 
>> copy in case one fails the checksum is nice, but what if they BOTH do?
>> I'd love to have the choice of (at least) three-way-mirroring, as for me 
>> that seems the best practical hassle/cost vs. risk balance I could get, 
>> but it's not yet possible. =:^(
> 
> I'm on the fence on n-way. 
> 
> HDDs get bigger at a faster rate than their performance improves, so rebuild 
> times keep getting higher. For cases where the data is really important, 
> backup-restore doesn't provide the necessary uptime, and minimum single drive 
> performance is needed, it can make sense to want three copies.
> 
> But what's the probability of both drives in a mirrored raid set dying, 
> compared to something else in the storage stack dying? I think at 3 copies, 
> you've got other risks that the 3rd copy doesn't manage, like a power supply, 
> controller card, or logic board dying.
> 
The risk isn't as much both drives dying at the same time as one dying
during a rebuild of the array, which is more and more likely as drives
get bigger and bigger.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Austin S Hemmelgarn
On 2014-01-09 12:31, Chris Murphy wrote:
> 
> On Jan 9, 2014, at 5:52 AM, Austin S Hemmelgarn
>  wrote:
>> Just a thought, you might consider running btrfs on top of LVM in
>> the interim, it isn't quite as efficient as btrfs by itself, but
>> it does allow N-way mirroring (and the efficiency is much better
>> now that they have switched to RAID1 as the default mirroring
>> backend)
> 
> The problem that in case of mismatches, it's ambiguous which are
> correct.
> 
At the moment that is correct, I've been planning for some time now to
write a patch so that the RAID1 implementation on more than 2 devices
checks what the majority of other devices say about the block, and
then updates all of them with the majority.  Barring a manufacturing
defect or firmware bug, any group of three or more disks is
statistically very unlikely to have a read error at the same place on
each disk until they have accumulated enough bad sectors that they are
totally unusable, so this would allow recovery in a non-degraded RAID1
array in most cases.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Murphy

On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Having checksumming is good, and a second 
> copy in case one fails the checksum is nice, but what if they BOTH do?
> I'd love to have the choice of (at least) three-way-mirroring, as for me 
> that seems the best practical hassle/cost vs. risk balance I could get, 
> but it's not yet possible. =:^(

I'm on the fence on n-way. 

HDDs get bigger at a faster rate than their performance improves, so rebuild 
times keep getting higher. For cases where the data is really important, 
backup-restore doesn't provide the necessary uptime, and minimum single drive 
performance is needed, it can make sense to want three copies.

But what's the probability of both drives in a mirrored raid set dying, 
compared to something else in the storage stack dying? I think at 3 copies, 
you've got other risks that the 3rd copy doesn't manage, like a power supply, 
controller card, or logic board dying.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Eleftheriou
> How is a resilient 2 disk failure possible with four disk raid10?

   __ ___ RAID0
__|__   __|__   ___ RAID1
 | || |
AB  CD

Loosing A+C / A+D / B+C / B+D  is resilient.
Loosing A+B or C+D is catastrophic.

Sorry, it's my fault. In my urge to praise Duncan's promotion of n-way
mirroring I created a misunderstanding.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: Btrfs: convert printk to btrfs_ and fix BTRFS prefix

2014-01-09 Thread Dan Carpenter
Hello Frank Holton,

This is a semi-automatic email about new static checker warnings.

The patch f2ee0bf65a1c: "Btrfs: convert printk to btrfs_ and fix 
BTRFS prefix" from Dec 20, 2013, leads to the following Smatch 
complaint:

fs/btrfs/super.c:298 __btrfs_panic()
 error: we previously assumed 'fs_info' could be null (see line 294)

fs/btrfs/super.c
   293  errstr = btrfs_decode_error(errno);
   294  if (fs_info && (fs_info->mount_opt & 
BTRFS_MOUNT_PANIC_ON_FATAL_ERROR))
^^^
Existing check.

   295  panic(KERN_CRIT "BTRFS panic (device %s) in %s:%d: %pV 
(errno=%d %s)\n",
   296  s_id, function, line, &vaf, errno, errstr);
   297  
   298  btrfs_crit(fs_info, "panic in %s:%d: %pV (errno=%d %s)",
   ^^^
Patch introduces new unchecked dereference inside btrfs_crit().

   299 function, line, &vaf, errno, errstr);
   300  va_end(args);

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Hugo Mills
On Thu, Jan 09, 2014 at 06:34:23PM +0100, George Eleftheriou wrote:
> > claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive
> > an arbitrary 2-device failure is incorrect.
> 
> Yes, you are right.  I didn't mean "any 2 devices". I should have
> added "from different mirrors" :)

   If you have an even number of devices and all the devices are the
same size, then:

 * the block group allocator will use all the devices each time
 * the amount of space on each device will always be the same

If the sort in the allocator is stable and resolves ties in free space
by using the device ID number, the above properties should guarantee
that the allocation is stable, so each new block group will have the
same functional chunk on the same device, and you get your wish.

   It's been a few months since I looked at that code, but I don't
recall seeing anything directly contradictory to the above
assumptions.

   Of course, if you have an odd number of devices, the allocator will
omit a different device on each block group, and you lose the ability
to survive (some) two-device failures. I suspect that the odds of
surviving a two-device failure are still non-zero, but less than if
you had an even number of devices. I'm not about to attempt an
ab-initio computation of the probabilities, but it shouldn't be too
hard to do either a monte-carlo simulation or a simple brute-force
enumeration of the possibilities for a given configuration.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   ---  My code is never released,  it escapes from the ---   
  git repo and kills a few beta testers on the way out   


signature.asc
Description: Digital signature


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Eleftheriou
> claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive
> an arbitrary 2-device failure is incorrect.

Yes, you are right.  I didn't mean "any 2 devices". I should have
added "from different mirrors" :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Murphy

On Jan 9, 2014, at 5:52 AM, Austin S Hemmelgarn  wrote:
> Just a thought, you might consider running btrfs on top of LVM in the
> interim, it isn't quite as efficient as btrfs by itself, but it does
> allow N-way mirroring (and the efficiency is much better now that they
> have switched to RAID1 as the default mirroring backend)

The problem that in case of mismatches, it's ambiguous which are correct.


Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Murphy

On Jan 9, 2014, at 9:49 AM, George Eleftheriou  wrote:
> 
> I'm really looking forward to the day that typing:
> 
> mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]
> 
> will do exactly what is expected to do. A true RAID10 resilient in 2
> disks' failure. Simple and beautiful.


How is a resilient 2 disk failure possible with four disk raid10?

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Hugo Mills
On Thu, Jan 09, 2014 at 05:49:48PM +0100, George Eleftheriou wrote:
> Duncan,
> 
> As a silent reader of this list (for almost a year)...
> As an anonymous supporter of the BAARF (Battle Against Any RAID
> Four/Five/Six/ Z etc...) initiative...
> 
> I can only break my silence and applaud your frequent interventions
> referring to N-Way mirroring (searching the list for the string
> "n-way" brings up almost exclusively your posts, at least in recent
> times).
> 
> Because that's what I' m also eager to see implemented in BTRFS and
> somehow felt disappointed that it wasn't given priority over the
> parity solutions...
> 
> I currently use ZFS on Linux in a 4-disk RAID10 (performing pretty
> good by the way) being stuck with the 3.11 kernel because of DKMS
> issues and not being able to share by SMB or NFS because of some bugs.
> 
> I'm really looking forward to the day that typing:
> 
> mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]
> 
> will do exactly what is expected to do. A true RAID10 resilient in 2
> disks' failure. Simple and beautiful.

   RAID-10 isn't guaranteed to be robust against two devices failing.
Not just the btrfs implementation -- any RAID-10 will die if the wrong
two devices fail. In the simplest case:

A }
B } Mirrored}
}
C } }
D } Mirrored} Striped
}
E } }
F } Mirrored}

   If A and B both die, then you're stuffed. (For the four-disk case,
just remove E and F from the diagram).

   If you want to talk odds, then that's OK, I'll admit that btrfs
doesn't necessarily do as well there(*) as the scheme above. But
claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive
an arbitrary 2-device failure is incorrect.

   Hugo.

(*) Actually, I suspect that with even numbers of equal-sized disks,
it'll do just as well, but I'm not willing to guarantee that behaviour
without hacking up the allocator a bit to add the capability.

> We're almost there...
> 
> Best regards to all BTRFS developers/contributors

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
  --- But people have always eaten people,  / what else is there to ---  
 eat?  / If the Juju had meant us not to eat people / he 
 wouldn't have made us of meat.  


signature.asc
Description: Digital signature


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread George Eleftheriou
Duncan,

As a silent reader of this list (for almost a year)...
As an anonymous supporter of the BAARF (Battle Against Any RAID
Four/Five/Six/ Z etc...) initiative...

I can only break my silence and applaud your frequent interventions
referring to N-Way mirroring (searching the list for the string
"n-way" brings up almost exclusively your posts, at least in recent
times).

Because that's what I' m also eager to see implemented in BTRFS and
somehow felt disappointed that it wasn't given priority over the
parity solutions...

I currently use ZFS on Linux in a 4-disk RAID10 (performing pretty
good by the way) being stuck with the 3.11 kernel because of DKMS
issues and not being able to share by SMB or NFS because of some bugs.

I'm really looking forward to the day that typing:

mkfs.btrfs -d raid10 -m raid10  /dev/sd[abcd]

will do exactly what is expected to do. A true RAID10 resilient in 2
disks' failure. Simple and beautiful.

We're almost there...

Best regards to all BTRFS developers/contributors
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Duncan
Austin S Hemmelgarn posted on Thu, 09 Jan 2014 07:52:44 -0500 as
excerpted:

> On 2014-01-09 07:41, Duncan wrote:
>> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted:
>> 
>>> If a [btrfs ]block is read and fails its checksum, then the other
>>> copy (in RAID-1) is checked and used if it's good. The bad copy is
>>> rewritten to use the good data.
>> 
>> This is why I'm so looking forward to the planned N-way-mirroring,
>> aka true-raid-1, feature, as opposed to btrfs' current 2-way-only
>> mirroring.  Having checksumming is good, and a second copy in case
>> one fails the checksum is nice, but what if they BOTH do? I'd love
>> to have the choice of (at least) three-way-mirroring, as for me
>> that seems the best practical hassle/cost vs. risk balance I
>> could get, but it's not yet possible. =:^(
>> 
> Just a thought, you might consider running btrfs on top of LVM in the
> interim, it isn't quite as efficient as btrfs by itself, but it does
> allow N-way mirroring (and the efficiency is much better now that they
> have switched to RAID1 as the default mirroring backend)

Except... AFAIK LVM is like mdraid in that regard -- no checksums, 
leaving the software entirely at the mercy of the hardware's ability to 
detect and properly report failure.

In fact, it's exactly as bad as that, since while both lvm and mdraid 
offer N-way-mirroring, they generally only fetch a single unchecksummed 
copy from whatever mirror they happen to choose to request it from, and 
use whatever they get without even a comparison againt the other copies 
to see if they match or majority vote on which is the valid copy if 
something doesn't match.  The ONLY way they know there's an error (unless 
the hardware reports one) at all is if a deliberate scrub is done.

And the raid5/6 parity-checking isn't any better, as while those parities 
are written, they're never checked or otherwise actually used except in 
recovery.  Normal read operation is just like raid0; only the device(s) 
containing the data itself is(are) read, no parity/checksum checking at 
all, even tho the trouble was taken to calculate and write it out.  When 
I had mdraid6 deployed and realized that, I switched back to raid1 (which 
would have been raid10 on a larger system), because while I considered 
the raid6 performance costs worth it for parity checking, they most 
definitely weren't once I realized all those calculates and writes were 
for nothing unless an actual device died, and raid1 gave me THAT level of 
protection at far better performance.

Which means neither lvm nor mdraid solve the problem at all.  Even btrfs 
on top of them won't solve the problem, while adding all sorts of 
complexity, because btrfs still has only the two-way check, and if one 
device gets corrupted in the underlying mirrors but another actually 
returns the data, btrfs will be entirely oblivious.

What one /could/ in theory do at the moment, altho it's hardly worth it 
due to the complexity[1] and the fact that btrfs itself is still a 
relatively immature filesystem under heavy development, and thus not 
suited to being part of such extreme solutions yet, is layered raid1 
btrfs on loopback over raid1 btrfs, say four devices, separate on-the-
hardware-device raid1 btrfs on two pairs, with a single huge loopback-
file on each lower-level btrfs, with raid1 btrfs layered on top of the 
loopback devices, too, manually creating an effective 4-device btrfs 
raid11.  Or use btrf raid10 at one or the other level and make it an 8-
device btrfs raid101 or raid110.  Tho as I said btrfs maturity level in 
general is a mismatch for such extreme measures, at present.  But in 
theory...

Zfs is arguably a more practically viable solution as it's mature and 
ready for deployment today, tho there's legal/license issues with the 
Linux kernel module and the usual userspace performance issues (tho the 
btrfs-on-loopback-on-btrfs solution above wouldn't be performance issue 
free either) with the fuse alternative.

I'm sure that's why a lot of folks needing multi-mirror checksum-verified 
reliability remain on Solaris/OpenIndiana/ZFS-on-BSD, as Linux simply 
doesn't /have/ a solution for that yet.  Btrfs /will/ have it, but as I 
explained, it's taking awhile.

---
[1] Complexity: Complexity can be the PRIMARY failure factor when an 
admin must understand enough about the layout to reliably manage recovery 
when they're already under the extreme pressure of a disaster recovery 
situation.  If complexity in even an otherwise 100% reliable solution is 
high enough an admin isn't confident of his ability to manage it, then 
the admin themself becomes the week link the the reliability chain!!  
That's the reason I tried and ultimately dropped lvm over mdraid here, 
since I couldn't be confident in my ability to understand both well 
enough to without admin error recover from disaster.  Thus, higher 
complexity really *IS* a SERIOUS negative in this sort of discussion, 
since it can be *T

Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Chris Mason
On Thu, 2014-01-09 at 12:41 +, Duncan wrote:
> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted:
> 
> > On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
> >> Hi,
> >> 
> >> I am running write-intensive (well sort of, one write every 10s)
> >> workloads on cheap flash media which proved to be horribly unreliable.
> >> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen
> >> drive returns bogus data without any warning at all.
> >> 
> >> So I wonder, how would btrfs behave in raid1 on two such devices? Would
> >> it simply mark bad blocks as "bad" and continue to be operational, or
> >> will it bail out when some block can not be read/written anymore on one
> >> of the two devices?
> > 
> > If a block is read and fails its checksum, then the other copy (in
> > RAID-1) is checked and used if it's good. The bad copy is rewritten to
> > use the good data.
> 
> This is why I'm (semi-impatiently, but not being a coder, I have little 
> choice, and I do see advances happening) so looking forward to the 
> planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' 
> current 2-way-only mirroring.  Having checksumming is good, and a second 
> copy in case one fails the checksum is nice, but what if they BOTH do?
> I'd love to have the choice of (at least) three-way-mirroring, as for me 
> that seems the best practical hassle/cost vs. risk balance I could get, 
> but it's not yet possible. =:^(
> 
> For (at least) year now, the roadmap has had N-way-mirroring on the list 
> for after raid5/6 as they want to build on its features, but (like much 
> of the btrfs work) raid5/6 took about three kernels longer to introduce 
> than originally thought, and even when introduced, the raid5/6 feature 
> lacked some critical parts (like scrub) and wasn't considered real-world 
> usable as integrity over a crash and/or device failure, the primary 
> feature of raid5/6, couldn't be assured.  

I'm frustrated too that I haven't pushed this out yet.  I've been trying
different methods to keep the performance up and in the end tried to do
pile on too many other features in the patches.  So, I'm breaking it up
a bit and reworking things for faster release.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels

2014-01-09 Thread Chris Mason
On Thu, 2014-01-09 at 12:16 +, Hugo Mills wrote:
> On Thu, Jan 09, 2014 at 12:49:48PM +0100, Stefan Behrens wrote:
> > On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote:
> > > Some users complaint that with latest btrfs-progs, they will
> > > fail to use send/receive. The problem is new tool will try
> > > to use uuid tree while it dosen't work on older kernel.
> > > 
> > > Now we first check if we support uuid tree, if not we fall into
> > > normal search as previous way.i copy most of codes from Alexander
> > > Block's previous codes and did some adjustments to make it work.
> > > 
> > > Signed-off-by: Alexander Block 
> > > Signed-off-by: Wang Shilong 
> > > ---
> > >  send-utils.c | 352 
> > > ++-
> > >  send-utils.h |  11 ++
> > >  2 files changed, 359 insertions(+), 4 deletions(-)
> > 
> > I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is
> > found. The code that you add will never be tested by anyone and will
> > become broken sooner or later.
> > 
> > The new kernel is compatible to old progs and to new progs. But new
> > progs require a new kernel and IMO this is normal.
> 
>No. Really, no. I think I would be extremely upset to upgrade, say,
> util-linux, only to discover that I needed a new kernel for cp to
> continue to work. I hope you would be, too.
> 
>You may need to upgrade the kernel to get new features offered by a
> new userspace, but I think we should absolutely not be changing
> userspace in a way that makes it incompatible with older kernels.

I'd really prefer that we maintain compatibility with the older kernels.
Heavy btrfs usage is going to want a newer kernel anyway, but this is an
important policy to keep in place for the future.

Especially since Wang went to the trouble of making the patch, I'd
rather take it.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Austin S Hemmelgarn
On 2014-01-09 07:41, Duncan wrote:
> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted:
> 
>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>>> Hi,
>>>
>>> I am running write-intensive (well sort of, one write every 10s)
>>> workloads on cheap flash media which proved to be horribly unreliable.
>>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen
>>> drive returns bogus data without any warning at all.
>>>
>>> So I wonder, how would btrfs behave in raid1 on two such devices? Would
>>> it simply mark bad blocks as "bad" and continue to be operational, or
>>> will it bail out when some block can not be read/written anymore on one
>>> of the two devices?
>>
>> If a block is read and fails its checksum, then the other copy (in
>> RAID-1) is checked and used if it's good. The bad copy is rewritten to
>> use the good data.
> 
> This is why I'm (semi-impatiently, but not being a coder, I have little 
> choice, and I do see advances happening) so looking forward to the 
> planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' 
> current 2-way-only mirroring.  Having checksumming is good, and a second 
> copy in case one fails the checksum is nice, but what if they BOTH do?
> I'd love to have the choice of (at least) three-way-mirroring, as for me 
> that seems the best practical hassle/cost vs. risk balance I could get, 
> but it's not yet possible. =:^(
> 
> For (at least) year now, the roadmap has had N-way-mirroring on the list 
> for after raid5/6 as they want to build on its features, but (like much 
> of the btrfs work) raid5/6 took about three kernels longer to introduce 
> than originally thought, and even when introduced, the raid5/6 feature 
> lacked some critical parts (like scrub) and wasn't considered real-world 
> usable as integrity over a crash and/or device failure, the primary 
> feature of raid5/6, couldn't be assured.  That itself was about three 
> kernels ago now, and the raid5/6 functionality remains partial -- it 
> writes the data and parities as it should, but scrub and recovery remain 
> only partially coded, so it looks like that'll /still/ be a few more 
> kernels before that's fully implemented and most bugs worked out, with 
> very likely a similar story to play out for N-way-mirroring after that, 
> thus placing it late this year for introduction and early next for 
> actually usable stability.
> 
> But it remains on the roadmap and btrfs should have it... eventually.  
> Meanwhile, I keep telling myself that this is filesystem code which a LOT 
> of folks including me stake the survival of their data on, and I along 
> with all the others definitely prefer it done CORRECTLY, even if it takes 
> TEN years longer than intended, than have it sloppily and unreliably 
> implemented sooner.
> 
> But it's still hard to wait, when sometimes I begin to think of it like 
> that carrot suspended in front of the donkey, never to actually be 
> reached.  Except... I *DO* see changes, and after originally taking off 
> for a few months after my original btrfs investigation, finding it 
> unusable in its then-current state, upon coming back about 5 months 
> later, actual usability and stability on current features had improved to 
> the point that I'm actually using it now, so there's certainly progress 
> being made, and the fact that I'm actually using it now attests to that 
> progress *NOT* being a simple illusion.  So it'll come, even if it /does/ 
> sometimes seem it's Duke-Nukem-Forever.
> 
Just a thought, you might consider running btrfs on top of LVM in the
interim, it isn't quite as efficient as btrfs by itself, but it does
allow N-way mirroring (and the efficiency is much better now that they
have switched to RAID1 as the default mirroring backend)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: fix to make list specified directory's subvolumes work

2014-01-09 Thread Wang Shilong
Steps to reproduce:
 # mkfs.btrfs -f /dev/sda8
 # mount /dev/sda8 /mnt
 # mkdir /mnt/subvolumes
 # btrfs sub create /mnt/subvolumes/subv1
 # btrfs sub create /mnt/subvolumes/subv1/subv1.1
 # btrfs sub list -o /mnt/subvolumes/subv1   path);
len = add_len;
}
+   if (!ri->top_id)
+   ri->top_id = found->ref_tree;
 
next = found->ref_tree;
-
-   if (next == top_id) {
-   ri->top_id = top_id;
+   if (next == top_id)
break;
-   }
-
/*
* if the ref_tree = BTRFS_FS_TREE_OBJECTID,
* we are at the top
*/
-   if (next == BTRFS_FS_TREE_OBJECTID) {
-   ri->top_id = next;
+   if (next == BTRFS_FS_TREE_OBJECTID)
break;
-   }
-
/*
* if the ref_tree wasn't in our tree of roots, the
* subvolume was deleted.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Duncan
Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted:

> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
>> Hi,
>> 
>> I am running write-intensive (well sort of, one write every 10s)
>> workloads on cheap flash media which proved to be horribly unreliable.
>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen
>> drive returns bogus data without any warning at all.
>> 
>> So I wonder, how would btrfs behave in raid1 on two such devices? Would
>> it simply mark bad blocks as "bad" and continue to be operational, or
>> will it bail out when some block can not be read/written anymore on one
>> of the two devices?
> 
> If a block is read and fails its checksum, then the other copy (in
> RAID-1) is checked and used if it's good. The bad copy is rewritten to
> use the good data.

This is why I'm (semi-impatiently, but not being a coder, I have little 
choice, and I do see advances happening) so looking forward to the 
planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' 
current 2-way-only mirroring.  Having checksumming is good, and a second 
copy in case one fails the checksum is nice, but what if they BOTH do?
I'd love to have the choice of (at least) three-way-mirroring, as for me 
that seems the best practical hassle/cost vs. risk balance I could get, 
but it's not yet possible. =:^(

For (at least) year now, the roadmap has had N-way-mirroring on the list 
for after raid5/6 as they want to build on its features, but (like much 
of the btrfs work) raid5/6 took about three kernels longer to introduce 
than originally thought, and even when introduced, the raid5/6 feature 
lacked some critical parts (like scrub) and wasn't considered real-world 
usable as integrity over a crash and/or device failure, the primary 
feature of raid5/6, couldn't be assured.  That itself was about three 
kernels ago now, and the raid5/6 functionality remains partial -- it 
writes the data and parities as it should, but scrub and recovery remain 
only partially coded, so it looks like that'll /still/ be a few more 
kernels before that's fully implemented and most bugs worked out, with 
very likely a similar story to play out for N-way-mirroring after that, 
thus placing it late this year for introduction and early next for 
actually usable stability.

But it remains on the roadmap and btrfs should have it... eventually.  
Meanwhile, I keep telling myself that this is filesystem code which a LOT 
of folks including me stake the survival of their data on, and I along 
with all the others definitely prefer it done CORRECTLY, even if it takes 
TEN years longer than intended, than have it sloppily and unreliably 
implemented sooner.

But it's still hard to wait, when sometimes I begin to think of it like 
that carrot suspended in front of the donkey, never to actually be 
reached.  Except... I *DO* see changes, and after originally taking off 
for a few months after my original btrfs investigation, finding it 
unusable in its then-current state, upon coming back about 5 months 
later, actual usability and stability on current features had improved to 
the point that I'm actually using it now, so there's certainly progress 
being made, and the fact that I'm actually using it now attests to that 
progress *NOT* being a simple illusion.  So it'll come, even if it /does/ 
sometimes seem it's Duke-Nukem-Forever.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: REQ: btrfs list option

2014-01-09 Thread Wang Shilong

On 01/09/2014 12:06 PM, Alex wrote:

Chris Murphy  colorremedies.com> writes:


Specify the mount point for the Btrfs file system and it will list all

subvols on that file system.

Chris Murphy--

Thank you Chris.

When I do that on my version of the 3.12 userland:
# btrfs  sub list / -o

There is a bug for 'btrfs sub list -o path', i will send a patch for this

Thanks,
Wang

returns nothing (with no error), which I wasn't quite expecting because
there *are* other snapshots and subvols below '/'

AND

# btrfs  sub list / -s
correctly lists the snapshots only.

I don't understand, what or if, I'm doing something wrong.
Thank you in advance.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels

2014-01-09 Thread Hugo Mills
On Thu, Jan 09, 2014 at 12:49:48PM +0100, Stefan Behrens wrote:
> On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote:
> > Some users complaint that with latest btrfs-progs, they will
> > fail to use send/receive. The problem is new tool will try
> > to use uuid tree while it dosen't work on older kernel.
> > 
> > Now we first check if we support uuid tree, if not we fall into
> > normal search as previous way.i copy most of codes from Alexander
> > Block's previous codes and did some adjustments to make it work.
> > 
> > Signed-off-by: Alexander Block 
> > Signed-off-by: Wang Shilong 
> > ---
> >  send-utils.c | 352 
> > ++-
> >  send-utils.h |  11 ++
> >  2 files changed, 359 insertions(+), 4 deletions(-)
> 
> I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is
> found. The code that you add will never be tested by anyone and will
> become broken sooner or later.
> 
> The new kernel is compatible to old progs and to new progs. But new
> progs require a new kernel and IMO this is normal.

   No. Really, no. I think I would be extremely upset to upgrade, say,
util-linux, only to discover that I needed a new kernel for cp to
continue to work. I hope you would be, too.

   You may need to upgrade the kernel to get new features offered by a
new userspace, but I think we should absolutely not be changing
userspace in a way that makes it incompatible with older kernels. If
that involves lots of fallback code that checks "is this ioctl
available? if not, use this method instead", then so be it. At this
point, a rewrite of code in the userspace tools should _not_ logically
remove the old code it is replacing, but should keep the old behaviour
to use when the new kernel interfaces it relies on are not present.

> A printf is friendly
> enough in this case.
> 
> IMHO maintaining compatibility in progs to old kernels should be limited
> to code that is small enough to not cost effort and problems in the future.

   The fallback code will remain stable and should require minimal
maintenance, because it will only get run on older kernels -- which,
by definition, won't be changing much. As long as there is no great
refactoring of the old code (maybe a comment to mark it as legacy
support, and that it shouldn't be reworked heavily?), I don't see a
major problem here, even for quite large chunks of code.

   Hugo.

> > diff --git a/send-utils.c b/send-utils.c
> > index 874f8a5..1772d2c 100644
> > --- a/send-utils.c
> > +++ b/send-utils.c
> > @@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 
> > root_id,
> > return 0;
> >  }
> >  
> > +static struct rb_node *tree_insert(struct rb_root *root,
> > +  struct subvol_info *si,
> > +  enum subvol_search_type type)
> > +{
> > +   struct rb_node **p = &root->rb_node;
> > +   struct rb_node *parent = NULL;
> > +   struct subvol_info *entry;
> > +   __s64 comp;
> > +
> > +   while (*p) {
> > +   parent = *p;
> > +   if (type == subvol_search_by_received_uuid) {
> > +   entry = rb_entry(parent, struct subvol_info,
> > +   rb_received_node);
> > +
> > +   comp = memcmp(entry->received_uuid, si->received_uuid,
> > +   BTRFS_UUID_SIZE);
> > +   if (!comp) {
> > +   if (entry->stransid < si->stransid)
> > +   comp = -1;
> > +   else if (entry->stransid > si->stransid)
> > +   comp = 1;
> > +   else
> > +   comp = 0;
> > +   }
> > +   } else if (type == subvol_search_by_uuid) {
> > +   entry = rb_entry(parent, struct subvol_info,
> > +   rb_local_node);
> > +   comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE);
> > +   } else if (type == subvol_search_by_root_id) {
> > +   entry = rb_entry(parent, struct subvol_info,
> > +   rb_root_id_node);
> > +   comp = entry->root_id - si->root_id;
> > +   } else if (type == subvol_search_by_path) {
> > +   entry = rb_entry(parent, struct subvol_info,
> > +   rb_path_node);
> > +   comp = strcmp(entry->path, si->path);
> > +   } else {
> > +   BUG();
> > +   }
> > +
> > +   if (comp < 0)
> > +   p = &(*p)->rb_left;
> > +   else if (comp > 0)
> > +   p = &(*p)->rb_right;
> > +   else
> > +   return parent;
> > +   }
> > +
> > +   if (type == subvol_search_by_received_uuid) {
> > +   rb_link_node(&si->rb_received_node, parent, p);
> > +   rb_insert_color(&s

Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels

2014-01-09 Thread Wang Shilong

Hi Stefan,

On 01/09/2014 07:49 PM, Stefan Behrens wrote:

On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote:

Some users complaint that with latest btrfs-progs, they will
fail to use send/receive. The problem is new tool will try
to use uuid tree while it dosen't work on older kernel.

Now we first check if we support uuid tree, if not we fall into
normal search as previous way.i copy most of codes from Alexander
Block's previous codes and did some adjustments to make it work.

Signed-off-by: Alexander Block 
Signed-off-by: Wang Shilong 
---
  send-utils.c | 352 ++-
  send-utils.h |  11 ++
  2 files changed, 359 insertions(+), 4 deletions(-)

I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is
found. The code that you add will never be tested by anyone and will
become broken sooner or later.

The new kernel is compatible to old progs and to new progs. But new
progs require a new kernel and IMO this is normal. A printf is friendly
enough in this case.

IMHO maintaining compatibility in progs to old kernels should be limited
to code that is small enough to not cost effort and problems in the future.


Firstly i'd say sorry about that i forgot cc you.

Yeah, both ways are ok for me, Let's wait to see what is other
people's opinions about this. ^_^

Thanks,
Wang




diff --git a/send-utils.c b/send-utils.c
index 874f8a5..1772d2c 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 root_id,
return 0;
  }
  
+static struct rb_node *tree_insert(struct rb_root *root,

+  struct subvol_info *si,
+  enum subvol_search_type type)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent = NULL;
+   struct subvol_info *entry;
+   __s64 comp;
+
+   while (*p) {
+   parent = *p;
+   if (type == subvol_search_by_received_uuid) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_received_node);
+
+   comp = memcmp(entry->received_uuid, si->received_uuid,
+   BTRFS_UUID_SIZE);
+   if (!comp) {
+   if (entry->stransid < si->stransid)
+   comp = -1;
+   else if (entry->stransid > si->stransid)
+   comp = 1;
+   else
+   comp = 0;
+   }
+   } else if (type == subvol_search_by_uuid) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_local_node);
+   comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE);
+   } else if (type == subvol_search_by_root_id) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_root_id_node);
+   comp = entry->root_id - si->root_id;
+   } else if (type == subvol_search_by_path) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_path_node);
+   comp = strcmp(entry->path, si->path);
+   } else {
+   BUG();
+   }
+
+   if (comp < 0)
+   p = &(*p)->rb_left;
+   else if (comp > 0)
+   p = &(*p)->rb_right;
+   else
+   return parent;
+   }
+
+   if (type == subvol_search_by_received_uuid) {
+   rb_link_node(&si->rb_received_node, parent, p);
+   rb_insert_color(&si->rb_received_node, root);
+   } else if (type == subvol_search_by_uuid) {
+   rb_link_node(&si->rb_local_node, parent, p);
+   rb_insert_color(&si->rb_local_node, root);
+   } else if (type == subvol_search_by_root_id) {
+   rb_link_node(&si->rb_root_id_node, parent, p);
+   rb_insert_color(&si->rb_root_id_node, root);
+   } else if (type == subvol_search_by_path) {
+   rb_link_node(&si->rb_path_node, parent, p);
+   rb_insert_color(&si->rb_path_node, root);
+   }
+   return NULL;
+}
+
  int btrfs_subvolid_resolve(int fd, char *path, size_t path_len, u64 subvol_id)
  {
if (path_len < 1)
@@ -255,13 +320,101 @@ static int btrfs_subvolid_resolve_sub(int fd, char 
*path, size_t *path_len,
return 0;
  }
  
+static int count_bytes(void *buf, int len, char b)

+{
+   int cnt = 0;
+   int i;
+
+   for (i = 0; i < len; i++) {
+   if (((char *)buf)[i] == b)
+   cnt++;
+   }
+   return cnt;
+}
+
  void subvol_uuid_search_add(struct subvol_uuid_

Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels

2014-01-09 Thread Stefan Behrens
On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote:
> Some users complaint that with latest btrfs-progs, they will
> fail to use send/receive. The problem is new tool will try
> to use uuid tree while it dosen't work on older kernel.
> 
> Now we first check if we support uuid tree, if not we fall into
> normal search as previous way.i copy most of codes from Alexander
> Block's previous codes and did some adjustments to make it work.
> 
> Signed-off-by: Alexander Block 
> Signed-off-by: Wang Shilong 
> ---
>  send-utils.c | 352 
> ++-
>  send-utils.h |  11 ++
>  2 files changed, 359 insertions(+), 4 deletions(-)

I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is
found. The code that you add will never be tested by anyone and will
become broken sooner or later.

The new kernel is compatible to old progs and to new progs. But new
progs require a new kernel and IMO this is normal. A printf is friendly
enough in this case.

IMHO maintaining compatibility in progs to old kernels should be limited
to code that is small enough to not cost effort and problems in the future.


> 
> diff --git a/send-utils.c b/send-utils.c
> index 874f8a5..1772d2c 100644
> --- a/send-utils.c
> +++ b/send-utils.c
> @@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 root_id,
>   return 0;
>  }
>  
> +static struct rb_node *tree_insert(struct rb_root *root,
> +struct subvol_info *si,
> +enum subvol_search_type type)
> +{
> + struct rb_node **p = &root->rb_node;
> + struct rb_node *parent = NULL;
> + struct subvol_info *entry;
> + __s64 comp;
> +
> + while (*p) {
> + parent = *p;
> + if (type == subvol_search_by_received_uuid) {
> + entry = rb_entry(parent, struct subvol_info,
> + rb_received_node);
> +
> + comp = memcmp(entry->received_uuid, si->received_uuid,
> + BTRFS_UUID_SIZE);
> + if (!comp) {
> + if (entry->stransid < si->stransid)
> + comp = -1;
> + else if (entry->stransid > si->stransid)
> + comp = 1;
> + else
> + comp = 0;
> + }
> + } else if (type == subvol_search_by_uuid) {
> + entry = rb_entry(parent, struct subvol_info,
> + rb_local_node);
> + comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE);
> + } else if (type == subvol_search_by_root_id) {
> + entry = rb_entry(parent, struct subvol_info,
> + rb_root_id_node);
> + comp = entry->root_id - si->root_id;
> + } else if (type == subvol_search_by_path) {
> + entry = rb_entry(parent, struct subvol_info,
> + rb_path_node);
> + comp = strcmp(entry->path, si->path);
> + } else {
> + BUG();
> + }
> +
> + if (comp < 0)
> + p = &(*p)->rb_left;
> + else if (comp > 0)
> + p = &(*p)->rb_right;
> + else
> + return parent;
> + }
> +
> + if (type == subvol_search_by_received_uuid) {
> + rb_link_node(&si->rb_received_node, parent, p);
> + rb_insert_color(&si->rb_received_node, root);
> + } else if (type == subvol_search_by_uuid) {
> + rb_link_node(&si->rb_local_node, parent, p);
> + rb_insert_color(&si->rb_local_node, root);
> + } else if (type == subvol_search_by_root_id) {
> + rb_link_node(&si->rb_root_id_node, parent, p);
> + rb_insert_color(&si->rb_root_id_node, root);
> + } else if (type == subvol_search_by_path) {
> + rb_link_node(&si->rb_path_node, parent, p);
> + rb_insert_color(&si->rb_path_node, root);
> + }
> + return NULL;
> +}
> +
>  int btrfs_subvolid_resolve(int fd, char *path, size_t path_len, u64 
> subvol_id)
>  {
>   if (path_len < 1)
> @@ -255,13 +320,101 @@ static int btrfs_subvolid_resolve_sub(int fd, char 
> *path, size_t *path_len,
>   return 0;
>  }
>  
> +static int count_bytes(void *buf, int len, char b)
> +{
> + int cnt = 0;
> + int i;
> +
> + for (i = 0; i < len; i++) {
> + if (((char *)buf)[i] == b)
> + cnt++;
> + }
> + return cnt;
> +}
> +
>  void subvol_uuid_search_add(struct subvol_uuid_search *s,
>   struct subvol_info *si)
>  {
> - if (si) {
> - free(si->path);
> - free(si);
> + 

Re: Problems with incremental send/receive

2014-01-09 Thread Felix Blanke
Hi Wang,

thank you for your answer.

I am using the latest btrfs-progs with the 3.12 kernel. I don't have
access to the machine right now (it looks like it crashed :/) but I
can send the exact versions when I'm home.

Regards,
Felix

On Thu, Jan 9, 2014 at 3:10 AM, Wang Shilong  wrote:
> Hi Felix,
>
> It seems some reported this problem before. The problem for your below case
> is because you use latest btrfs-progs(v3.12?), which will need kernel
> update,
> kernel 3.12 is ok.
>
> However, i think btrfs-progs should keep compatibility, i will send a patch
> to
> make things more friendly.
>
> Thanks,
> Wang
>
> On 01/09/2014 06:04 AM, Felix Blanke wrote:
>>
>> Hi List,
>>
>> My backup stopped working and I can't figure out why. I'm using
>> send/receive with the "-p" switch for incremental backups using the
>> last snapshot as a parent snapshot for sending only the changed data.
>>
>> The problem occurs using my own backup script. After I discovered the
>> problem I did a quick test using the exact commands from the wiki with
>> the same result: It doesn't work. Here is the output:
>>
>>
>> server ~ # ./test_snapshot.sh
>> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/
>> /mnt/root1/snapshots/test
>> Create a readonly snapshot of '/mnt/root1/@root_home/' in
>> '/mnt/root1/snapshots/test'
>> ++ sync
>> ++ btrfs send /mnt/root1/snapshots/test
>> ++ btrfs receive /mnt/backup1/
>> At subvol /mnt/root1/snapshots/test
>> At subvol test
>> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/
>> /mnt/root1/snapshots/test_new
>> Create a readonly snapshot of '/mnt/root1/@root_home/' in
>> '/mnt/root1/snapshots/test_new'
>> ++ sync
>> ++ btrfs send -p /mnt/root1/snapshots/test /mnt/root1/snapshots/test_new
>> ++ btrfs receive /mnt/backup1/
>> At subvol /mnt/root1/snapshots/test_new
>> At snapshot test_new
>> ERROR: open @/test failed. No such file or directory
>>
>> I don't get where the "@/" in front of the snapshot name comes from.
>> It could be that I had a subvolume named @, but this doesn't exists
>> anymore and I don't understand why this would be important for the
>> send/receive.
>>
>> Some more details about the fs:
>>
>> server ~ # btrfs subvol list /mnt/root1/
>> ID 259 gen 568053 top level 5 path @root
>> ID 261 gen 568053 top level 5 path @var
>> ID 263 gen 568049 top level 5 path @home
>> ID 302 gen 568053 top level 5 path @owncloud_chroot
>> ID 421 gen 568038 top level 5 path @root_home
>> ID 30560 gen 563661 top level 5 path snapshots/home_2014-01-06-19:33_d
>> ID 30561 gen 563665 top level 5 path
>> snapshots/owncloud_chroot_2014-01-06-19:34_d
>> ID 30562 gen 563674 top level 5 path
>> snapshots/root_home_2014-01-06-19:38_d
>> ID 30563 gen 563675 top level 5 path snapshots/var_2014-01-06-19:39_d
>> ID 30564 gen 563697 top level 5 path snapshots/root_2014-01-06-19:50_d
>>
>> server ~ # btrfs subvol get-default /mnt/root1/
>> ID 5 (FS_TREE)
>>
>> server ~ # ls -l /mnt/root1/
>> total 0
>> drwxr-xr-x. 1 root root  30 May 10  2013 @home
>> drwxr-xr-x. 1 root root 134 Jan  5 19:27 @owncloud_chroot
>> drwxr-xr-x. 1 root root 204 Nov 24 18:16 @root
>> drwx--. 1 root root 468 Jan  8 22:47 @root_home
>> drwxr-xr-x. 1 root root 114 Oct  7 17:39 @var
>> drwx--. 1 root root 420 Jan  8 22:50 snapshots
>>
>>
>> Any ideas? Thanks in advance.
>>
>>
>> Regards,
>> Felix
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message tomajord...@vger.kernel.org
>> More majordomo info athttp://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: make send/receive compatible with older kernels

2014-01-09 Thread Wang Shilong
Some users complaint that with latest btrfs-progs, they will
fail to use send/receive. The problem is new tool will try
to use uuid tree while it dosen't work on older kernel.

Now we first check if we support uuid tree, if not we fall into
normal search as previous way.i copy most of codes from Alexander
Block's previous codes and did some adjustments to make it work.

Signed-off-by: Alexander Block 
Signed-off-by: Wang Shilong 
---
 send-utils.c | 352 ++-
 send-utils.h |  11 ++
 2 files changed, 359 insertions(+), 4 deletions(-)

diff --git a/send-utils.c b/send-utils.c
index 874f8a5..1772d2c 100644
--- a/send-utils.c
+++ b/send-utils.c
@@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 root_id,
return 0;
 }
 
+static struct rb_node *tree_insert(struct rb_root *root,
+  struct subvol_info *si,
+  enum subvol_search_type type)
+{
+   struct rb_node **p = &root->rb_node;
+   struct rb_node *parent = NULL;
+   struct subvol_info *entry;
+   __s64 comp;
+
+   while (*p) {
+   parent = *p;
+   if (type == subvol_search_by_received_uuid) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_received_node);
+
+   comp = memcmp(entry->received_uuid, si->received_uuid,
+   BTRFS_UUID_SIZE);
+   if (!comp) {
+   if (entry->stransid < si->stransid)
+   comp = -1;
+   else if (entry->stransid > si->stransid)
+   comp = 1;
+   else
+   comp = 0;
+   }
+   } else if (type == subvol_search_by_uuid) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_local_node);
+   comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE);
+   } else if (type == subvol_search_by_root_id) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_root_id_node);
+   comp = entry->root_id - si->root_id;
+   } else if (type == subvol_search_by_path) {
+   entry = rb_entry(parent, struct subvol_info,
+   rb_path_node);
+   comp = strcmp(entry->path, si->path);
+   } else {
+   BUG();
+   }
+
+   if (comp < 0)
+   p = &(*p)->rb_left;
+   else if (comp > 0)
+   p = &(*p)->rb_right;
+   else
+   return parent;
+   }
+
+   if (type == subvol_search_by_received_uuid) {
+   rb_link_node(&si->rb_received_node, parent, p);
+   rb_insert_color(&si->rb_received_node, root);
+   } else if (type == subvol_search_by_uuid) {
+   rb_link_node(&si->rb_local_node, parent, p);
+   rb_insert_color(&si->rb_local_node, root);
+   } else if (type == subvol_search_by_root_id) {
+   rb_link_node(&si->rb_root_id_node, parent, p);
+   rb_insert_color(&si->rb_root_id_node, root);
+   } else if (type == subvol_search_by_path) {
+   rb_link_node(&si->rb_path_node, parent, p);
+   rb_insert_color(&si->rb_path_node, root);
+   }
+   return NULL;
+}
+
 int btrfs_subvolid_resolve(int fd, char *path, size_t path_len, u64 subvol_id)
 {
if (path_len < 1)
@@ -255,13 +320,101 @@ static int btrfs_subvolid_resolve_sub(int fd, char 
*path, size_t *path_len,
return 0;
 }
 
+static int count_bytes(void *buf, int len, char b)
+{
+   int cnt = 0;
+   int i;
+
+   for (i = 0; i < len; i++) {
+   if (((char *)buf)[i] == b)
+   cnt++;
+   }
+   return cnt;
+}
+
 void subvol_uuid_search_add(struct subvol_uuid_search *s,
struct subvol_info *si)
 {
-   if (si) {
-   free(si->path);
-   free(si);
+   int cnt;
+
+   tree_insert(&s->root_id_subvols, si, subvol_search_by_root_id);
+   tree_insert(&s->path_subvols, si, subvol_search_by_path);
+
+   cnt = count_bytes(si->uuid, BTRFS_UUID_SIZE, 0);
+   if (cnt != BTRFS_UUID_SIZE)
+   tree_insert(&s->local_subvols, si, subvol_search_by_uuid);
+   cnt = count_bytes(si->received_uuid, BTRFS_UUID_SIZE, 0);
+   if (cnt != BTRFS_UUID_SIZE)
+   tree_insert(&s->received_subvols, si,
+   subvol_search_by_received_uuid);
+}
+
+static struct subvol_info *tree_search(struct rb_root *root,
+  

Re: How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Hugo Mills
On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote:
> Hi,
> 
> I am running write-intensive (well sort of, one write every 10s)
> workloads on cheap flash media which proved to be horribly unreliable.
> A 32GB microSDHC card reported bad blocks after 4 days, while a usb
> pen drive returns bogus data without any warning at all.
> 
> So I wonder, how would btrfs behave in raid1 on two such devices?
> Would it simply mark bad blocks as "bad" and continue to be
> operational, or will it bail out when some block can not be
> read/written anymore on one of the two devices?

   If a block is read and fails its checksum, then the other copy (in
RAID-1) is checked and used if it's good. The bad copy is rewritten to
use the good data.

   If the block is bad such that writing to it won't fix it, then
there's probably two cases: the device returns an IO error, in which
case I suspect (but can't be sure) that the FS will go read-only. Or
the device silently fails the write and claims success, in which case
you're back to the situation above of the block failing its checksum.

   There's no marking of bad blocks right now, and I don't know of
anyone working on the feature, so the FS will probably keep going back
to the bad blocks as it makes CoW copies for modification.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Trouble rather the tiger in his lair than the sage amongst ---
his books for to you kingdoms and their armies are mighty
and enduring,  but to him they are but toys of the moment
  to be overturned by the flicking of a finger.  


signature.asc
Description: Digital signature


How does btrfs handle bad blocks in raid1?

2014-01-09 Thread Clemens Eisserer
Hi,

I am running write-intensive (well sort of, one write every 10s)
workloads on cheap flash media which proved to be horribly unreliable.
A 32GB microSDHC card reported bad blocks after 4 days, while a usb
pen drive returns bogus data without any warning at all.

So I wonder, how would btrfs behave in raid1 on two such devices?
Would it simply mark bad blocks as "bad" and continue to be
operational, or will it bail out when some block can not be
read/written anymore on one of the two devices?

Thank you in advance, Clemens
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: REQ: btrfs list option

2014-01-09 Thread Alex
Chris Murphy  colorremedies.com> writes:

> > Hmm, actually you might have found a bug.
> > 
> > Small typo while we're at it, below should have one l.
> 
> kernel-3.13.0-0.rc6.git0.1.fc21.x86_64
> btrfs-progs-3.12-1.fc20.x86_64
> 
> Chris Murphy
> 

Thank you muchly!
I'm kinda glad because I didn't really understand your second response in
context. ;-) I would have been stuck had you not updated it!




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html