Re: Problems with incremental send/receive
Hi Wang, here are the versioninformation: server log # btrfs version Btrfs v3.12-dirty server log # uname -a Linux server.home 3.12.6-hardened-r3 #1 SMP Thu Jan 2 13:16:48 CET 2014 x86_64 Intel(R) Celeron(R) CPU G1610 @ 2.60GHz GenuineIntel GNU/Linux This should work if I understood you correct? Regards, Felix On Thu, Jan 9, 2014 at 12:36 PM, Felix Blanke wrote: > Hi Wang, > > thank you for your answer. > > I am using the latest btrfs-progs with the 3.12 kernel. I don't have > access to the machine right now (it looks like it crashed :/) but I > can send the exact versions when I'm home. > > Regards, > Felix > > On Thu, Jan 9, 2014 at 3:10 AM, Wang Shilong > wrote: >> Hi Felix, >> >> It seems some reported this problem before. The problem for your below case >> is because you use latest btrfs-progs(v3.12?), which will need kernel >> update, >> kernel 3.12 is ok. >> >> However, i think btrfs-progs should keep compatibility, i will send a patch >> to >> make things more friendly. >> >> Thanks, >> Wang >> >> On 01/09/2014 06:04 AM, Felix Blanke wrote: >>> >>> Hi List, >>> >>> My backup stopped working and I can't figure out why. I'm using >>> send/receive with the "-p" switch for incremental backups using the >>> last snapshot as a parent snapshot for sending only the changed data. >>> >>> The problem occurs using my own backup script. After I discovered the >>> problem I did a quick test using the exact commands from the wiki with >>> the same result: It doesn't work. Here is the output: >>> >>> >>> server ~ # ./test_snapshot.sh >>> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/ >>> /mnt/root1/snapshots/test >>> Create a readonly snapshot of '/mnt/root1/@root_home/' in >>> '/mnt/root1/snapshots/test' >>> ++ sync >>> ++ btrfs send /mnt/root1/snapshots/test >>> ++ btrfs receive /mnt/backup1/ >>> At subvol /mnt/root1/snapshots/test >>> At subvol test >>> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/ >>> /mnt/root1/snapshots/test_new >>> Create a readonly snapshot of '/mnt/root1/@root_home/' in >>> '/mnt/root1/snapshots/test_new' >>> ++ sync >>> ++ btrfs send -p /mnt/root1/snapshots/test /mnt/root1/snapshots/test_new >>> ++ btrfs receive /mnt/backup1/ >>> At subvol /mnt/root1/snapshots/test_new >>> At snapshot test_new >>> ERROR: open @/test failed. No such file or directory >>> >>> I don't get where the "@/" in front of the snapshot name comes from. >>> It could be that I had a subvolume named @, but this doesn't exists >>> anymore and I don't understand why this would be important for the >>> send/receive. >>> >>> Some more details about the fs: >>> >>> server ~ # btrfs subvol list /mnt/root1/ >>> ID 259 gen 568053 top level 5 path @root >>> ID 261 gen 568053 top level 5 path @var >>> ID 263 gen 568049 top level 5 path @home >>> ID 302 gen 568053 top level 5 path @owncloud_chroot >>> ID 421 gen 568038 top level 5 path @root_home >>> ID 30560 gen 563661 top level 5 path snapshots/home_2014-01-06-19:33_d >>> ID 30561 gen 563665 top level 5 path >>> snapshots/owncloud_chroot_2014-01-06-19:34_d >>> ID 30562 gen 563674 top level 5 path >>> snapshots/root_home_2014-01-06-19:38_d >>> ID 30563 gen 563675 top level 5 path snapshots/var_2014-01-06-19:39_d >>> ID 30564 gen 563697 top level 5 path snapshots/root_2014-01-06-19:50_d >>> >>> server ~ # btrfs subvol get-default /mnt/root1/ >>> ID 5 (FS_TREE) >>> >>> server ~ # ls -l /mnt/root1/ >>> total 0 >>> drwxr-xr-x. 1 root root 30 May 10 2013 @home >>> drwxr-xr-x. 1 root root 134 Jan 5 19:27 @owncloud_chroot >>> drwxr-xr-x. 1 root root 204 Nov 24 18:16 @root >>> drwx--. 1 root root 468 Jan 8 22:47 @root_home >>> drwxr-xr-x. 1 root root 114 Oct 7 17:39 @var >>> drwx--. 1 root root 420 Jan 8 22:50 snapshots >>> >>> >>> Any ideas? Thanks in advance. >>> >>> >>> Regards, >>> Felix >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message tomajord...@vger.kernel.org >>> More majordomo info athttp://vger.kernel.org/majordomo-info.html >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
backpointer mismatch
Hi, I am using btrfs for my backup RAID. This had been running well for about a year. Recently I decided the upgrade the backup server to openSUSE 13.1. I checked all filesystems before the upgrade and everything was clean. I had several attempts at upgrading the system, but all failed (the installation of some rpm would hang indefinitely). So I aborted the installation and reverted the system back to openSUSE 12.3 (with a custom-installed 3.9.7 kernel). Unfortunately, after this the backup RAID reported lots of errors. When I run btrfsck on the filesystem, I get around 1.3M of these messages: Extent back ref already exists for 1116254208 parent 11145490432 root 0 and around 1.2M of these: ref mismatch on [90670907392 4096] extent item 11, found 12 Incorrect global backref count on 90670907392 found 11 wanted 12 backpointer mismatch on [90670907392 4096] Filtering these out, this is the remaining output: checking extents Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots checking csums checking root refs Checking filesystem on /dev/md2 UUID: 0b6a9d0d-e501-4a23-9d09-259b1f5b5652 found 2213988384746 bytes used err is 0 total csum bytes: 3185850148 total tree bytes: 42770862080 total fs tree bytes: 36787625984 total extent tree bytes: 1643925504 btree space waste bytes: 12475940633 file data blocks allocated: 5269432860672 referenced 5254870626304 Btrfs v3.12+20131125 (this version of btrfsck comes from openSUSE factory). I also ran btrfs scrub on the file system. This uncovered 4 checksum errors which I could repair manually. I do not know if that is related to the problem above. At least it didn't solve it... The btrfs file system is installed on top of an mdadm RAID5. How worried should I be about the reported errors? What confuses me is that in the end btrfsck reports an error count of 0. Should I try to repair this? I have had bad experiences in the past with "btrfsck --repair", but that was with a much older version... I can of course recreate the backups, but this would take a long time and I would loose my entire snapshot history which I would rather avoid... Cheers, Peter. -- Peter van Hoof Royal Observatory of Belgium Ringlaan 3 1180 Brussel Belgium http://homepage.oma.be/pvh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Donation!
My wife and I won £148m on the Euromillions lottery & will be donating £1.5 Million each to you and four other individuals in our ongoing charity project, get back to us for more details on how you can receieve your donation. See article for more info - http://www.bbc.co.uk/news/uk-england-19254228 Best regards, Adrian & Gillian Bayford -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On 01/09/2014 05:06 PM, Jim Salter wrote: On Jan 9, 2014 7:46 PM, George Mitchell wrote: I would prefer that the drive, even flash media type, would catch and resolve write failures. If it doesn't happen at the hardware layer, according to how I understand Hugo's answer, btrfs, at least for now, is not capable of it. Not sure what you mean by this. If a bit flips on a btrfs-raid1 block, btrfs will detect it. Then it checks the mirror's copy of that block. It returns the good copy, then immediately writes the good copy over the bad copy. I know this because I tested it directly just last week by flipping a bit in an offline btrfs filesystem manually. When I brought the volume back online and read the file containing the bit I flipped, it operated exactly as described, and logged its actions in kern.log, . :-) Jim, my point was that IF the drive does not successfully resolve the bad block issue and btrfs takes a write failure every time it attempts to overwrite the bad data, it is not going to remap that data, but rather it is going to fail the drive. In other words, if the drive has a bad sector which it has not done anything about at the drive level, btrfs will not remap the sector. It will, rather, fail the drive. Is that not correct? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
Hello Clemens, On 01/09/2014 04:08 PM, Clemens Eisserer wrote: Hi George, I really suspect a lot of bad block issues can be avoided by monitoring SMART data. SMART is working very well for me with btrfs formatted drives. SMART will detect when sectors silently fail and as those failures accumulate, SMART will warn in an obvious way that the drive in question is at end of life. So I think the whole bad block issue should ideally be handled at a lower level than filesystem with modern hard drives. At least my original request was about cheap flash media, where you don't have the luxury that you can "trust" the hardware behaving properly. In fact, it might be benefitial for a SD card to not report ECC errors - most likely the user won't notice a small glitch playing back music - but he definitively will when the smartphone reports read errors and stopping playback which will cause that card to be RMAd. Also, wouldn't your argument be also valid for checksums - why checksum in software, when in theory the drive + controllers should do it anyway ;) Regards, Clemens -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html It would certainly be a vast improvement if flash media had some of the sanity checking capability that conventional media has, but to say that these sorts of problems with flash media are legendary would almost be an understatement. As for checksums, I view them more as a tool to detect data decay as opposed to checking for failed writes. Of course that data decay might well result in failed writes when btrfs scrub tries to correct it. At that point I would prefer that the drive, even flash media type, would catch and resolve write failures. If it doesn't happen at the hardware layer, according to how I understand Hugo's answer, btrfs, at least for now, is not capable of it. I believe it is true that filesystems historically done bad blocking, but I do think it is moving now to the hardware layer which is probably the best place for it to be and the flash drive industry needs to solve this problem at the hardware/firmware level. That is my opinion anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
Hi George, > I really suspect a lot of bad block issues can be avoided by monitoring > SMART data. SMART is working very well for me with btrfs formatted drives. > SMART will detect when sectors silently fail and as those failures > accumulate, SMART will warn in an obvious way that the drive in question is > at end of life. So I think the whole bad block issue should ideally be > handled at a lower level than filesystem with modern hard drives. At least my original request was about cheap flash media, where you don't have the luxury that you can "trust" the hardware behaving properly. In fact, it might be benefitial for a SD card to not report ECC errors - most likely the user won't notice a small glitch playing back music - but he definitively will when the smartphone reports read errors and stopping playback which will cause that card to be RMAd. Also, wouldn't your argument be also valid for checksums - why checksum in software, when in theory the drive + controllers should do it anyway ;) Regards, Clemens -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS_SEARCH_ARGS_BUFSIZE too small
Hello, I'm playing around with the BTRFS_IOC_SEARCH_TREE to extract the csums of the physical blocks. During the tests some item_header had len = 0, which indicates the buffer was to small to hold the item. I added a printk into the kernel to get the original size of the item and it was around 6600 bytes. Is there another way to get the item? Otherwise I would suggest to create an ioctl, which is a little bit more flexible, something like struct btrfs_ioctl_search_args2 { struct btrfs_ioctl_search_key key; __u64 buf_len char buf[0]; }; Gerhard -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FILE_EXTENT_SAME changes mtime and ctime
2014/1/6 David Sterba : > On Mon, Jan 06, 2014 at 12:02:51AM +0100, Gerhard Heift wrote: >> I am currently playing with snapshots and manual deduplication of >> files. During these tests I noticed the change of ctime and mtime in >> the snapshot after the deduplication with FILE_EXTENT_SAME. Does this >> happens on purpose? Otherwise I would like to have ctime and mtime >> left unmodified, because on a read only snapshot I cannot change them >> back after the ioctl call. > > I'm not sure what's the correct behaviour wrt timestamps and extent > cloning. The inode metadata are modified in some way, but the stat data > and actual contents are left unchanged, so the timestamps do not reflect > that something changed according to their definition (stat(2)). > > On the other hand, the differences can be seen in the extent listing, > the physical offset of the blocks will change. I'm not aware of any > tools that would become broken by breaking this assumption. Also, the > (partial) cloning functionality is not implemented anywhere so we could > have a look and try to stay consistent with that. > > My oppinion is to drop the mtime/iversion updates completely. In my opinion, we should never update, if we dedup content of files with FILE_EXTENT_SAME. If we clone with CLONE(_RANGE), the mtime should be updated, because its like a write operation. The semantics of ctime is not completely clear to me. It should change if the "visible" meta data of a file changes. In this cases should it be updated if write to it, because mtime changes, or only if the size of the file changes? > david Gerhard -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
I really suspect a lot of bad block issues can be avoided by monitoring SMART data. SMART is working very well for me with btrfs formatted drives. SMART will detect when sectors silently fail and as those failures accumulate, SMART will warn in an obvious way that the drive in question is at end of life. So I think the whole bad block issue should ideally be handled at a lower level than filesystem with modern hard drives. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Help repairing corrupt btrfs -- btrfsck --repair doesn't change anything
I have a btrfs partition with what *sounds* like minor damage; btrfsck --repair prints | enabling repair mode | Checking filesystem on /dev/sda | UUID: ec93d2c2-7937-40f8-aaa6- c20c9775d93a | checking extents | checking free space cache | cache and super generation don't match, space cache will be invalidated | checking fs roots | root 258 inode 4493802 errors 400, nbytes wrong | root 258 inode 4509858 errors 400, nbytes wrong | root 258 inode 4510014 errors 400, nbytes wrong | root 258 inode 4838894 errors 400, nbytes wrong | root 258 inode 4838895 errors 400, nbytes wrong | found 41852229430 bytes used err is 1 | total csum bytes: 619630328 | total tree bytes: 3216027648 | total fs tree bytes: 2342981632 | total extent tree bytes: 135536640 | btree space waste bytes: 767795634 | file data blocks allocated: 1744289230848 | referenced 631766474752 | Btrfs v3.12 The trouble is, these errors do not go away if I run btrfsck --repair a second time, which implies that the problems have not actually been corrected. As the corruption has already caused one kernel oops (see https://bugzilla.kernel.org/show_bug.cgi?id=68411 ) I am reluctant to remount the file system until I am sure no corruption remains. I did try mounting it (and immediately unmounting it again) and that did seem to do some sort of additional check ("checking UUID tree") but a subsequent btrfsck run still prints the same errors. Any advice on how to fix this so it stays fixed would be appreciated. zw -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: skip non-regular files while defragmenting
Skip non-regular files to avoid ioctl errors while defragmenting. They are silently ignored in recursive mode but reported as errors when used as command-line arguments. Signed-off-by: Pascal VITOUX --- cmds-filesystem.c | 26 -- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 1c1926b..54fba10 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -646,7 +646,7 @@ static int defrag_callback(const char *fpath, const struct stat *sb, int e = 0; int fd = 0; - if (typeflag == FTW_F) { + if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) { if (defrag_global_verbose) printf("%s\n", fpath); fd = open(fpath, O_RDWR); @@ -748,6 +748,7 @@ static int cmd_defrag(int argc, char **argv) defrag_global_range.flags |= BTRFS_DEFRAG_RANGE_START_IO; for (i = optind; i < argc; i++) { + struct stat st; dirstream = NULL; fd = open_file_or_dir(argv[i], &dirstream); if (fd < 0) { @@ -757,16 +758,21 @@ static int cmd_defrag(int argc, char **argv) close_file_or_dir(fd, dirstream); continue; } + if (fstat(fd, &st)) { + fprintf(stderr, "ERROR: failed to stat %s - %s\n", + argv[i], strerror(errno)); + defrag_global_errors++; + close_file_or_dir(fd, dirstream); + continue; + } + if (!(S_ISDIR(st.st_mode) || S_ISREG(st.st_mode))) { + fprintf(stderr, "ERROR: %s is not a directory or a regular " + "file.\n", argv[i]); + defrag_global_errors++; + close_file_or_dir(fd, dirstream); + continue; + } if (recursive) { - struct stat st; - - if (fstat(fd, &st)) { - fprintf(stderr, "ERROR: failed to stat %s - %s\n", - argv[i], strerror(errno)); - defrag_global_errors++; - close_file_or_dir(fd, dirstream); - continue; - } if (S_ISDIR(st.st_mode)) { ret = nftw(argv[i], defrag_callback, 10, FTW_MOUNT | FTW_PHYS); -- 1.8.5.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: setup inode location during btrfs_init_inode_locked
We have a race during inode init because the BTRFS_I(inode)->location is setup after the inode hash table lock is dropped. btrfs_find_actor uses the location field, so our search might not find an existing inode in the hash table if we race with the inode init code. This commit things to setup the location field sooner. Also the find actor now uses only the location objectid to match inodes. For inode hashing, we just need a unique and stable test, it doesn't have to reflect the inode numbers we show to userland. Signed-off-by: Chris Mason diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9eaa1c8..8010b49 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -60,7 +60,7 @@ #include "hash.h" struct btrfs_iget_args { - u64 ino; + struct btrfs_key *location; struct btrfs_root *root; }; @@ -4932,7 +4932,9 @@ again: static int btrfs_init_locked_inode(struct inode *inode, void *p) { struct btrfs_iget_args *args = p; - inode->i_ino = args->ino; + inode->i_ino = args->location->objectid; + memcpy(&BTRFS_I(inode)->location, args->location, + sizeof(*args->location)); BTRFS_I(inode)->root = args->root; return 0; } @@ -4940,19 +4942,19 @@ static int btrfs_init_locked_inode(struct inode *inode, void *p) static int btrfs_find_actor(struct inode *inode, void *opaque) { struct btrfs_iget_args *args = opaque; - return args->ino == btrfs_ino(inode) && + return args->location->objectid == BTRFS_I(inode)->location.objectid && args->root == BTRFS_I(inode)->root; } static struct inode *btrfs_iget_locked(struct super_block *s, - u64 objectid, + struct btrfs_key *location, struct btrfs_root *root) { struct inode *inode; struct btrfs_iget_args args; - unsigned long hashval = btrfs_inode_hash(objectid, root); + unsigned long hashval = btrfs_inode_hash(location->objectid, root); - args.ino = objectid; + args.location = location; args.root = root; inode = iget5_locked(s, hashval, btrfs_find_actor, @@ -4969,13 +4971,11 @@ struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location, { struct inode *inode; - inode = btrfs_iget_locked(s, location->objectid, root); + inode = btrfs_iget_locked(s, location, root); if (!inode) return ERR_PTR(-ENOMEM); if (inode->i_state & I_NEW) { - BTRFS_I(inode)->root = root; - memcpy(&BTRFS_I(inode)->location, location, sizeof(*location)); btrfs_read_locked_inode(inode); if (!is_bad_inode(inode)) { inode_tree_add(inode); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Jan 9, 2014, at 12:13 PM, Kyle Gates wrote: > On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote: >> >> On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote: >> >>> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: Hi, I am running write-intensive (well sort of, one write every 10s) workloads on cheap flash media which proved to be horribly unreliable. A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen drive returns bogus data without any warning at all. So I wonder, how would btrfs behave in raid1 on two such devices? Would it simply mark bad blocks as "bad" and continue to be operational, or will it bail out when some block can not be read/written anymore on one of the two devices? >>> >>> If a block is read and fails its checksum, then the other copy (in >>> RAID-1) is checked and used if it's good. The bad copy is rewritten to >>> use the good data. >>> >>> If the block is bad such that writing to it won't fix it, then >>> there's probably two cases: the device returns an IO error, in which >>> case I suspect (but can't be sure) that the FS will go read-only. Or >>> the device silently fails the write and claims success, in which case >>> you're back to the situation above of the block failing its checksum. >> >> In a normally operating drive, when the drive firmware locates a physical >> sector with persistent write failures, it's dereferenced. So the LBA points >> to a reserve physical sector, the originally can't be accessed by LBA. If >> all of the reserve sectors get used up, the next persistent write failure >> will result in a write error reported to libata and this will appear in >> dmesg, and should be treated as the drive being no longer in normal >> operation. It's a drive useful for storage developers, but not for >> production usage. >> >>> There's no marking of bad blocks right now, and I don't know of >>> anyone working on the feature, so the FS will probably keep going back >>> to the bad blocks as it makes CoW copies for modification. >> >> This is maybe relevant: >> https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html >> >> "READ and WRITE commands report CHS or LBA of the first failed sector but >> ATA/ATAPI standard specifies that the amount of transferred data on error >> completion is indeterminate, so we cannot assume that sectors preceding the >> failed sector have been transferred and thus cannot complete those sectors >> successfully as SCSI does." >> >> If I understand that correctly, Btrfs really ought to either punt the >> device, or make the whole volume read-only. For production use, going >> read-only very well could mean data loss, even while preserving the state of >> the file system. Eventually I'd rather see the offending device ejected from >> the volume, and for the volume to remain rw,degraded. > > I would like to see btrfs hold onto the device in a read-only state like is > done during a device replace operation. New writes would maintain the raid > level but go out to the remaining devices and only go full filesystem > read-only if the minimum number of writable devices is not met. Once a new > device is added in, the replace operation could commence and drop the bad > device when complete. Sure that's a fine optimization for a bad device to be read-only while the volume is still rw, if that's possible. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: How does btrfs handle bad blocks in raid1?
On Thu, 9 Jan 2014 11:40:20 -0700 Chris Murphy wrote: > > On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote: > >> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: >>> Hi, >>> >>> I am running write-intensive (well sort of, one write every 10s) >>> workloads on cheap flash media which proved to be horribly unreliable. >>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb >>> pen drive returns bogus data without any warning at all. >>> >>> So I wonder, how would btrfs behave in raid1 on two such devices? >>> Would it simply mark bad blocks as "bad" and continue to be >>> operational, or will it bail out when some block can not be >>> read/written anymore on one of the two devices? >> >> If a block is read and fails its checksum, then the other copy (in >> RAID-1) is checked and used if it's good. The bad copy is rewritten to >> use the good data. >> >> If the block is bad such that writing to it won't fix it, then >> there's probably two cases: the device returns an IO error, in which >> case I suspect (but can't be sure) that the FS will go read-only. Or >> the device silently fails the write and claims success, in which case >> you're back to the situation above of the block failing its checksum. > > In a normally operating drive, when the drive firmware locates a physical > sector with persistent write failures, it's dereferenced. So the LBA points > to a reserve physical sector, the originally can't be accessed by LBA. If all > of the reserve sectors get used up, the next persistent write failure will > result in a write error reported to libata and this will appear in dmesg, and > should be treated as the drive being no longer in normal operation. It's a > drive useful for storage developers, but not for production usage. > >> There's no marking of bad blocks right now, and I don't know of >> anyone working on the feature, so the FS will probably keep going back >> to the bad blocks as it makes CoW copies for modification. > > This is maybe relevant: > https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html > > "READ and WRITE commands report CHS or LBA of the first failed sector but > ATA/ATAPI standard specifies that the amount of transferred data on error > completion is indeterminate, so we cannot assume that sectors preceding the > failed sector have been transferred and thus cannot complete those sectors > successfully as SCSI does." > > If I understand that correctly, Btrfs really ought to either punt the device, > or make the whole volume read-only. For production use, going read-only very > well could mean data loss, even while preserving the state of the file > system. Eventually I'd rather see the offending device ejected from the > volume, and for the volume to remain rw,degraded. I would like to see btrfs hold onto the device in a read-only state like is done during a device replace operation. New writes would maintain the raid level but go out to the remaining devices and only go full filesystem read-only if the minimum number of writable devices is not met. Once a new device is added in, the replace operation could commence and drop the bad device when complete. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Jan 9, 2014, at 11:22 AM, Austin S Hemmelgarn wrote: > On 2014-01-09 13:08, Chris Murphy wrote: >> >> On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.dun...@cox.net> wrote: >>> Having checksumming is good, and a second >>> copy in case one fails the checksum is nice, but what if they BOTH do? >>> I'd love to have the choice of (at least) three-way-mirroring, as for me >>> that seems the best practical hassle/cost vs. risk balance I could get, >>> but it's not yet possible. =:^( >> >> I'm on the fence on n-way. >> >> HDDs get bigger at a faster rate than their performance improves, so rebuild >> times keep getting higher. For cases where the data is really important, >> backup-restore doesn't provide the necessary uptime, and minimum single >> drive performance is needed, it can make sense to want three copies. >> >> But what's the probability of both drives in a mirrored raid set dying, >> compared to something else in the storage stack dying? I think at 3 copies, >> you've got other risks that the 3rd copy doesn't manage, like a power >> supply, controller card, or logic board dying. >> > The risk isn't as much both drives dying at the same time as one dying > during a rebuild of the array, which is more and more likely as drives > get bigger and bigger. Understood. I'm considering a 2nd drive dying during rebuild (from a 1st drive dying) as essentially simultaneous failures. And in the case of raid10, the likelihood of a 2nd drive failure being the lonesome drive in a mirrored set is statistically very unlikely. The next drive to fail is going to be some other drive in the array, which still has a mirror. I'm not saying there's no value in n-way. I'm just saying adding more redundancy only solves on particular vector for failure that's still probably less likely than losing a power supply or a controller or even user induced data loss that ends up affecting all three copies anyway. And yes, it's easier to just add drives and make 3 copies, than it is to setup a cluster. But that's the trade off when using such high density drives that the rebuild times cause consideration of adding even more high density drives to solve the problem. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
Thanks Hugo, Since: -- i keep daily backups -- all 4 devices are of the same size I think I can test it (as soon as I have some time to spend in the transition to BTRFS) and verify your assumptions (...and get my wish) >If you have an even number of devices and all the devices are the > same size, then: > > * the block group allocator will use all the devices each time > * the amount of space on each device will always be the same > > If the sort in the allocator is stable and resolves ties in free space > by using the device ID number, the above properties should guarantee > that the allocation is stable, so each new block group will have the > same functional chunk on the same device, and you get your wish. > >It's been a few months since I looked at that code, but I don't > recall seeing anything directly contradictory to the above > assumptions. > >Of course, if you have an odd number of devices, the allocator will > omit a different device on each block group, and you lose the ability > to survive (some) two-device failures. I suspect that the odds of > surviving a two-device failure are still non-zero, but less than if > you had an even number of devices. I'm not about to attempt an > ab-initio computation of the probabilities, but it shouldn't be too > hard to do either a monte-carlo simulation or a simple brute-force > enumeration of the possibilities for a given configuration. > >Hugo. > > -- > === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === > PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk >--- My code is never released, it escapes from the --- > git repo and kills a few beta testers on the way out -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Jan 9, 2014, at 3:42 AM, Hugo Mills wrote: > On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: >> Hi, >> >> I am running write-intensive (well sort of, one write every 10s) >> workloads on cheap flash media which proved to be horribly unreliable. >> A 32GB microSDHC card reported bad blocks after 4 days, while a usb >> pen drive returns bogus data without any warning at all. >> >> So I wonder, how would btrfs behave in raid1 on two such devices? >> Would it simply mark bad blocks as "bad" and continue to be >> operational, or will it bail out when some block can not be >> read/written anymore on one of the two devices? > > If a block is read and fails its checksum, then the other copy (in > RAID-1) is checked and used if it's good. The bad copy is rewritten to > use the good data. > > If the block is bad such that writing to it won't fix it, then > there's probably two cases: the device returns an IO error, in which > case I suspect (but can't be sure) that the FS will go read-only. Or > the device silently fails the write and claims success, in which case > you're back to the situation above of the block failing its checksum. In a normally operating drive, when the drive firmware locates a physical sector with persistent write failures, it's dereferenced. So the LBA points to a reserve physical sector, the originally can't be accessed by LBA. If all of the reserve sectors get used up, the next persistent write failure will result in a write error reported to libata and this will appear in dmesg, and should be treated as the drive being no longer in normal operation. It's a drive useful for storage developers, but not for production usage. > There's no marking of bad blocks right now, and I don't know of > anyone working on the feature, so the FS will probably keep going back > to the bad blocks as it makes CoW copies for modification. This is maybe relevant: https://www.kernel.org/doc/htmldocs/libata/ataExceptions.html "READ and WRITE commands report CHS or LBA of the first failed sector but ATA/ATAPI standard specifies that the amount of transferred data on error completion is indeterminate, so we cannot assume that sectors preceding the failed sector have been transferred and thus cannot complete those sectors successfully as SCSI does." If I understand that correctly, Btrfs really ought to either punt the device, or make the whole volume read-only. For production use, going read-only very well could mean data loss, even while preserving the state of the file system. Eventually I'd rather see the offending device ejected from the volume, and for the volume to remain rw,degraded. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On 2014-01-09 13:08, Chris Murphy wrote: > > On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.dun...@cox.net> wrote: >> Having checksumming is good, and a second >> copy in case one fails the checksum is nice, but what if they BOTH do? >> I'd love to have the choice of (at least) three-way-mirroring, as for me >> that seems the best practical hassle/cost vs. risk balance I could get, >> but it's not yet possible. =:^( > > I'm on the fence on n-way. > > HDDs get bigger at a faster rate than their performance improves, so rebuild > times keep getting higher. For cases where the data is really important, > backup-restore doesn't provide the necessary uptime, and minimum single drive > performance is needed, it can make sense to want three copies. > > But what's the probability of both drives in a mirrored raid set dying, > compared to something else in the storage stack dying? I think at 3 copies, > you've got other risks that the 3rd copy doesn't manage, like a power supply, > controller card, or logic board dying. > The risk isn't as much both drives dying at the same time as one dying during a rebuild of the array, which is more and more likely as drives get bigger and bigger. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On 2014-01-09 12:31, Chris Murphy wrote: > > On Jan 9, 2014, at 5:52 AM, Austin S Hemmelgarn > wrote: >> Just a thought, you might consider running btrfs on top of LVM in >> the interim, it isn't quite as efficient as btrfs by itself, but >> it does allow N-way mirroring (and the efficiency is much better >> now that they have switched to RAID1 as the default mirroring >> backend) > > The problem that in case of mismatches, it's ambiguous which are > correct. > At the moment that is correct, I've been planning for some time now to write a patch so that the RAID1 implementation on more than 2 devices checks what the majority of other devices say about the block, and then updates all of them with the majority. Barring a manufacturing defect or firmware bug, any group of three or more disks is statistically very unlikely to have a read error at the same place on each disk until they have accumulated enough bad sectors that they are totally unusable, so this would allow recovery in a non-degraded RAID1 array in most cases. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Jan 9, 2014, at 5:41 AM, Duncan <1i5t5.dun...@cox.net> wrote: > Having checksumming is good, and a second > copy in case one fails the checksum is nice, but what if they BOTH do? > I'd love to have the choice of (at least) three-way-mirroring, as for me > that seems the best practical hassle/cost vs. risk balance I could get, > but it's not yet possible. =:^( I'm on the fence on n-way. HDDs get bigger at a faster rate than their performance improves, so rebuild times keep getting higher. For cases where the data is really important, backup-restore doesn't provide the necessary uptime, and minimum single drive performance is needed, it can make sense to want three copies. But what's the probability of both drives in a mirrored raid set dying, compared to something else in the storage stack dying? I think at 3 copies, you've got other risks that the 3rd copy doesn't manage, like a power supply, controller card, or logic board dying. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
> How is a resilient 2 disk failure possible with four disk raid10? __ ___ RAID0 __|__ __|__ ___ RAID1 | || | AB CD Loosing A+C / A+D / B+C / B+D is resilient. Loosing A+B or C+D is catastrophic. Sorry, it's my fault. In my urge to praise Duncan's promotion of n-way mirroring I created a misunderstanding. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
re: Btrfs: convert printk to btrfs_ and fix BTRFS prefix
Hello Frank Holton, This is a semi-automatic email about new static checker warnings. The patch f2ee0bf65a1c: "Btrfs: convert printk to btrfs_ and fix BTRFS prefix" from Dec 20, 2013, leads to the following Smatch complaint: fs/btrfs/super.c:298 __btrfs_panic() error: we previously assumed 'fs_info' could be null (see line 294) fs/btrfs/super.c 293 errstr = btrfs_decode_error(errno); 294 if (fs_info && (fs_info->mount_opt & BTRFS_MOUNT_PANIC_ON_FATAL_ERROR)) ^^^ Existing check. 295 panic(KERN_CRIT "BTRFS panic (device %s) in %s:%d: %pV (errno=%d %s)\n", 296 s_id, function, line, &vaf, errno, errstr); 297 298 btrfs_crit(fs_info, "panic in %s:%d: %pV (errno=%d %s)", ^^^ Patch introduces new unchecked dereference inside btrfs_crit(). 299 function, line, &vaf, errno, errstr); 300 va_end(args); regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Thu, Jan 09, 2014 at 06:34:23PM +0100, George Eleftheriou wrote: > > claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive > > an arbitrary 2-device failure is incorrect. > > Yes, you are right. I didn't mean "any 2 devices". I should have > added "from different mirrors" :) If you have an even number of devices and all the devices are the same size, then: * the block group allocator will use all the devices each time * the amount of space on each device will always be the same If the sort in the allocator is stable and resolves ties in free space by using the device ID number, the above properties should guarantee that the allocation is stable, so each new block group will have the same functional chunk on the same device, and you get your wish. It's been a few months since I looked at that code, but I don't recall seeing anything directly contradictory to the above assumptions. Of course, if you have an odd number of devices, the allocator will omit a different device on each block group, and you lose the ability to survive (some) two-device failures. I suspect that the odds of surviving a two-device failure are still non-zero, but less than if you had an even number of devices. I'm not about to attempt an ab-initio computation of the probabilities, but it shouldn't be too hard to do either a monte-carlo simulation or a simple brute-force enumeration of the possibilities for a given configuration. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- My code is never released, it escapes from the --- git repo and kills a few beta testers on the way out signature.asc Description: Digital signature
Re: How does btrfs handle bad blocks in raid1?
> claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive > an arbitrary 2-device failure is incorrect. Yes, you are right. I didn't mean "any 2 devices". I should have added "from different mirrors" :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Jan 9, 2014, at 5:52 AM, Austin S Hemmelgarn wrote: > Just a thought, you might consider running btrfs on top of LVM in the > interim, it isn't quite as efficient as btrfs by itself, but it does > allow N-way mirroring (and the efficiency is much better now that they > have switched to RAID1 as the default mirroring backend) The problem that in case of mismatches, it's ambiguous which are correct. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Jan 9, 2014, at 9:49 AM, George Eleftheriou wrote: > > I'm really looking forward to the day that typing: > > mkfs.btrfs -d raid10 -m raid10 /dev/sd[abcd] > > will do exactly what is expected to do. A true RAID10 resilient in 2 > disks' failure. Simple and beautiful. How is a resilient 2 disk failure possible with four disk raid10? Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On Thu, Jan 09, 2014 at 05:49:48PM +0100, George Eleftheriou wrote: > Duncan, > > As a silent reader of this list (for almost a year)... > As an anonymous supporter of the BAARF (Battle Against Any RAID > Four/Five/Six/ Z etc...) initiative... > > I can only break my silence and applaud your frequent interventions > referring to N-Way mirroring (searching the list for the string > "n-way" brings up almost exclusively your posts, at least in recent > times). > > Because that's what I' m also eager to see implemented in BTRFS and > somehow felt disappointed that it wasn't given priority over the > parity solutions... > > I currently use ZFS on Linux in a 4-disk RAID10 (performing pretty > good by the way) being stuck with the 3.11 kernel because of DKMS > issues and not being able to share by SMB or NFS because of some bugs. > > I'm really looking forward to the day that typing: > > mkfs.btrfs -d raid10 -m raid10 /dev/sd[abcd] > > will do exactly what is expected to do. A true RAID10 resilient in 2 > disks' failure. Simple and beautiful. RAID-10 isn't guaranteed to be robust against two devices failing. Not just the btrfs implementation -- any RAID-10 will die if the wrong two devices fail. In the simplest case: A } B } Mirrored} } C } } D } Mirrored} Striped } E } } F } Mirrored} If A and B both die, then you're stuffed. (For the four-disk case, just remove E and F from the diagram). If you want to talk odds, then that's OK, I'll admit that btrfs doesn't necessarily do as well there(*) as the scheme above. But claiming that RAID-10 (with 2-way mirroring) is guaranteed to survive an arbitrary 2-device failure is incorrect. Hugo. (*) Actually, I suspect that with even numbers of equal-sized disks, it'll do just as well, but I'm not willing to guarantee that behaviour without hacking up the allocator a bit to add the capability. > We're almost there... > > Best regards to all BTRFS developers/contributors -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- But people have always eaten people, / what else is there to --- eat? / If the Juju had meant us not to eat people / he wouldn't have made us of meat. signature.asc Description: Digital signature
Re: How does btrfs handle bad blocks in raid1?
Duncan, As a silent reader of this list (for almost a year)... As an anonymous supporter of the BAARF (Battle Against Any RAID Four/Five/Six/ Z etc...) initiative... I can only break my silence and applaud your frequent interventions referring to N-Way mirroring (searching the list for the string "n-way" brings up almost exclusively your posts, at least in recent times). Because that's what I' m also eager to see implemented in BTRFS and somehow felt disappointed that it wasn't given priority over the parity solutions... I currently use ZFS on Linux in a 4-disk RAID10 (performing pretty good by the way) being stuck with the 3.11 kernel because of DKMS issues and not being able to share by SMB or NFS because of some bugs. I'm really looking forward to the day that typing: mkfs.btrfs -d raid10 -m raid10 /dev/sd[abcd] will do exactly what is expected to do. A true RAID10 resilient in 2 disks' failure. Simple and beautiful. We're almost there... Best regards to all BTRFS developers/contributors -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
Austin S Hemmelgarn posted on Thu, 09 Jan 2014 07:52:44 -0500 as excerpted: > On 2014-01-09 07:41, Duncan wrote: >> Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted: >> >>> If a [btrfs ]block is read and fails its checksum, then the other >>> copy (in RAID-1) is checked and used if it's good. The bad copy is >>> rewritten to use the good data. >> >> This is why I'm so looking forward to the planned N-way-mirroring, >> aka true-raid-1, feature, as opposed to btrfs' current 2-way-only >> mirroring. Having checksumming is good, and a second copy in case >> one fails the checksum is nice, but what if they BOTH do? I'd love >> to have the choice of (at least) three-way-mirroring, as for me >> that seems the best practical hassle/cost vs. risk balance I >> could get, but it's not yet possible. =:^( >> > Just a thought, you might consider running btrfs on top of LVM in the > interim, it isn't quite as efficient as btrfs by itself, but it does > allow N-way mirroring (and the efficiency is much better now that they > have switched to RAID1 as the default mirroring backend) Except... AFAIK LVM is like mdraid in that regard -- no checksums, leaving the software entirely at the mercy of the hardware's ability to detect and properly report failure. In fact, it's exactly as bad as that, since while both lvm and mdraid offer N-way-mirroring, they generally only fetch a single unchecksummed copy from whatever mirror they happen to choose to request it from, and use whatever they get without even a comparison againt the other copies to see if they match or majority vote on which is the valid copy if something doesn't match. The ONLY way they know there's an error (unless the hardware reports one) at all is if a deliberate scrub is done. And the raid5/6 parity-checking isn't any better, as while those parities are written, they're never checked or otherwise actually used except in recovery. Normal read operation is just like raid0; only the device(s) containing the data itself is(are) read, no parity/checksum checking at all, even tho the trouble was taken to calculate and write it out. When I had mdraid6 deployed and realized that, I switched back to raid1 (which would have been raid10 on a larger system), because while I considered the raid6 performance costs worth it for parity checking, they most definitely weren't once I realized all those calculates and writes were for nothing unless an actual device died, and raid1 gave me THAT level of protection at far better performance. Which means neither lvm nor mdraid solve the problem at all. Even btrfs on top of them won't solve the problem, while adding all sorts of complexity, because btrfs still has only the two-way check, and if one device gets corrupted in the underlying mirrors but another actually returns the data, btrfs will be entirely oblivious. What one /could/ in theory do at the moment, altho it's hardly worth it due to the complexity[1] and the fact that btrfs itself is still a relatively immature filesystem under heavy development, and thus not suited to being part of such extreme solutions yet, is layered raid1 btrfs on loopback over raid1 btrfs, say four devices, separate on-the- hardware-device raid1 btrfs on two pairs, with a single huge loopback- file on each lower-level btrfs, with raid1 btrfs layered on top of the loopback devices, too, manually creating an effective 4-device btrfs raid11. Or use btrf raid10 at one or the other level and make it an 8- device btrfs raid101 or raid110. Tho as I said btrfs maturity level in general is a mismatch for such extreme measures, at present. But in theory... Zfs is arguably a more practically viable solution as it's mature and ready for deployment today, tho there's legal/license issues with the Linux kernel module and the usual userspace performance issues (tho the btrfs-on-loopback-on-btrfs solution above wouldn't be performance issue free either) with the fuse alternative. I'm sure that's why a lot of folks needing multi-mirror checksum-verified reliability remain on Solaris/OpenIndiana/ZFS-on-BSD, as Linux simply doesn't /have/ a solution for that yet. Btrfs /will/ have it, but as I explained, it's taking awhile. --- [1] Complexity: Complexity can be the PRIMARY failure factor when an admin must understand enough about the layout to reliably manage recovery when they're already under the extreme pressure of a disaster recovery situation. If complexity in even an otherwise 100% reliable solution is high enough an admin isn't confident of his ability to manage it, then the admin themself becomes the week link the the reliability chain!! That's the reason I tried and ultimately dropped lvm over mdraid here, since I couldn't be confident in my ability to understand both well enough to without admin error recover from disaster. Thus, higher complexity really *IS* a SERIOUS negative in this sort of discussion, since it can be *T
Re: How does btrfs handle bad blocks in raid1?
On Thu, 2014-01-09 at 12:41 +, Duncan wrote: > Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted: > > > On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: > >> Hi, > >> > >> I am running write-intensive (well sort of, one write every 10s) > >> workloads on cheap flash media which proved to be horribly unreliable. > >> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen > >> drive returns bogus data without any warning at all. > >> > >> So I wonder, how would btrfs behave in raid1 on two such devices? Would > >> it simply mark bad blocks as "bad" and continue to be operational, or > >> will it bail out when some block can not be read/written anymore on one > >> of the two devices? > > > > If a block is read and fails its checksum, then the other copy (in > > RAID-1) is checked and used if it's good. The bad copy is rewritten to > > use the good data. > > This is why I'm (semi-impatiently, but not being a coder, I have little > choice, and I do see advances happening) so looking forward to the > planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' > current 2-way-only mirroring. Having checksumming is good, and a second > copy in case one fails the checksum is nice, but what if they BOTH do? > I'd love to have the choice of (at least) three-way-mirroring, as for me > that seems the best practical hassle/cost vs. risk balance I could get, > but it's not yet possible. =:^( > > For (at least) year now, the roadmap has had N-way-mirroring on the list > for after raid5/6 as they want to build on its features, but (like much > of the btrfs work) raid5/6 took about three kernels longer to introduce > than originally thought, and even when introduced, the raid5/6 feature > lacked some critical parts (like scrub) and wasn't considered real-world > usable as integrity over a crash and/or device failure, the primary > feature of raid5/6, couldn't be assured. I'm frustrated too that I haven't pushed this out yet. I've been trying different methods to keep the performance up and in the end tried to do pile on too many other features in the patches. So, I'm breaking it up a bit and reworking things for faster release. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels
On Thu, 2014-01-09 at 12:16 +, Hugo Mills wrote: > On Thu, Jan 09, 2014 at 12:49:48PM +0100, Stefan Behrens wrote: > > On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote: > > > Some users complaint that with latest btrfs-progs, they will > > > fail to use send/receive. The problem is new tool will try > > > to use uuid tree while it dosen't work on older kernel. > > > > > > Now we first check if we support uuid tree, if not we fall into > > > normal search as previous way.i copy most of codes from Alexander > > > Block's previous codes and did some adjustments to make it work. > > > > > > Signed-off-by: Alexander Block > > > Signed-off-by: Wang Shilong > > > --- > > > send-utils.c | 352 > > > ++- > > > send-utils.h | 11 ++ > > > 2 files changed, 359 insertions(+), 4 deletions(-) > > > > I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is > > found. The code that you add will never be tested by anyone and will > > become broken sooner or later. > > > > The new kernel is compatible to old progs and to new progs. But new > > progs require a new kernel and IMO this is normal. > >No. Really, no. I think I would be extremely upset to upgrade, say, > util-linux, only to discover that I needed a new kernel for cp to > continue to work. I hope you would be, too. > >You may need to upgrade the kernel to get new features offered by a > new userspace, but I think we should absolutely not be changing > userspace in a way that makes it incompatible with older kernels. I'd really prefer that we maintain compatibility with the older kernels. Heavy btrfs usage is going to want a newer kernel anyway, but this is an important policy to keep in place for the future. Especially since Wang went to the trouble of making the patch, I'd rather take it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
On 2014-01-09 07:41, Duncan wrote: > Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted: > >> On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: >>> Hi, >>> >>> I am running write-intensive (well sort of, one write every 10s) >>> workloads on cheap flash media which proved to be horribly unreliable. >>> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen >>> drive returns bogus data without any warning at all. >>> >>> So I wonder, how would btrfs behave in raid1 on two such devices? Would >>> it simply mark bad blocks as "bad" and continue to be operational, or >>> will it bail out when some block can not be read/written anymore on one >>> of the two devices? >> >> If a block is read and fails its checksum, then the other copy (in >> RAID-1) is checked and used if it's good. The bad copy is rewritten to >> use the good data. > > This is why I'm (semi-impatiently, but not being a coder, I have little > choice, and I do see advances happening) so looking forward to the > planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' > current 2-way-only mirroring. Having checksumming is good, and a second > copy in case one fails the checksum is nice, but what if they BOTH do? > I'd love to have the choice of (at least) three-way-mirroring, as for me > that seems the best practical hassle/cost vs. risk balance I could get, > but it's not yet possible. =:^( > > For (at least) year now, the roadmap has had N-way-mirroring on the list > for after raid5/6 as they want to build on its features, but (like much > of the btrfs work) raid5/6 took about three kernels longer to introduce > than originally thought, and even when introduced, the raid5/6 feature > lacked some critical parts (like scrub) and wasn't considered real-world > usable as integrity over a crash and/or device failure, the primary > feature of raid5/6, couldn't be assured. That itself was about three > kernels ago now, and the raid5/6 functionality remains partial -- it > writes the data and parities as it should, but scrub and recovery remain > only partially coded, so it looks like that'll /still/ be a few more > kernels before that's fully implemented and most bugs worked out, with > very likely a similar story to play out for N-way-mirroring after that, > thus placing it late this year for introduction and early next for > actually usable stability. > > But it remains on the roadmap and btrfs should have it... eventually. > Meanwhile, I keep telling myself that this is filesystem code which a LOT > of folks including me stake the survival of their data on, and I along > with all the others definitely prefer it done CORRECTLY, even if it takes > TEN years longer than intended, than have it sloppily and unreliably > implemented sooner. > > But it's still hard to wait, when sometimes I begin to think of it like > that carrot suspended in front of the donkey, never to actually be > reached. Except... I *DO* see changes, and after originally taking off > for a few months after my original btrfs investigation, finding it > unusable in its then-current state, upon coming back about 5 months > later, actual usability and stability on current features had improved to > the point that I'm actually using it now, so there's certainly progress > being made, and the fact that I'm actually using it now attests to that > progress *NOT* being a simple illusion. So it'll come, even if it /does/ > sometimes seem it's Duke-Nukem-Forever. > Just a thought, you might consider running btrfs on top of LVM in the interim, it isn't quite as efficient as btrfs by itself, but it does allow N-way mirroring (and the efficiency is much better now that they have switched to RAID1 as the default mirroring backend) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: fix to make list specified directory's subvolumes work
Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt # mkdir /mnt/subvolumes # btrfs sub create /mnt/subvolumes/subv1 # btrfs sub create /mnt/subvolumes/subv1/subv1.1 # btrfs sub list -o /mnt/subvolumes/subv1path); len = add_len; } + if (!ri->top_id) + ri->top_id = found->ref_tree; next = found->ref_tree; - - if (next == top_id) { - ri->top_id = top_id; + if (next == top_id) break; - } - /* * if the ref_tree = BTRFS_FS_TREE_OBJECTID, * we are at the top */ - if (next == BTRFS_FS_TREE_OBJECTID) { - ri->top_id = next; + if (next == BTRFS_FS_TREE_OBJECTID) break; - } - /* * if the ref_tree wasn't in our tree of roots, the * subvolume was deleted. -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs handle bad blocks in raid1?
Hugo Mills posted on Thu, 09 Jan 2014 10:42:47 + as excerpted: > On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: >> Hi, >> >> I am running write-intensive (well sort of, one write every 10s) >> workloads on cheap flash media which proved to be horribly unreliable. >> A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen >> drive returns bogus data without any warning at all. >> >> So I wonder, how would btrfs behave in raid1 on two such devices? Would >> it simply mark bad blocks as "bad" and continue to be operational, or >> will it bail out when some block can not be read/written anymore on one >> of the two devices? > > If a block is read and fails its checksum, then the other copy (in > RAID-1) is checked and used if it's good. The bad copy is rewritten to > use the good data. This is why I'm (semi-impatiently, but not being a coder, I have little choice, and I do see advances happening) so looking forward to the planned N-way-mirroring, aka true-raid-1, feature, as opposed to btrfs' current 2-way-only mirroring. Having checksumming is good, and a second copy in case one fails the checksum is nice, but what if they BOTH do? I'd love to have the choice of (at least) three-way-mirroring, as for me that seems the best practical hassle/cost vs. risk balance I could get, but it's not yet possible. =:^( For (at least) year now, the roadmap has had N-way-mirroring on the list for after raid5/6 as they want to build on its features, but (like much of the btrfs work) raid5/6 took about three kernels longer to introduce than originally thought, and even when introduced, the raid5/6 feature lacked some critical parts (like scrub) and wasn't considered real-world usable as integrity over a crash and/or device failure, the primary feature of raid5/6, couldn't be assured. That itself was about three kernels ago now, and the raid5/6 functionality remains partial -- it writes the data and parities as it should, but scrub and recovery remain only partially coded, so it looks like that'll /still/ be a few more kernels before that's fully implemented and most bugs worked out, with very likely a similar story to play out for N-way-mirroring after that, thus placing it late this year for introduction and early next for actually usable stability. But it remains on the roadmap and btrfs should have it... eventually. Meanwhile, I keep telling myself that this is filesystem code which a LOT of folks including me stake the survival of their data on, and I along with all the others definitely prefer it done CORRECTLY, even if it takes TEN years longer than intended, than have it sloppily and unreliably implemented sooner. But it's still hard to wait, when sometimes I begin to think of it like that carrot suspended in front of the donkey, never to actually be reached. Except... I *DO* see changes, and after originally taking off for a few months after my original btrfs investigation, finding it unusable in its then-current state, upon coming back about 5 months later, actual usability and stability on current features had improved to the point that I'm actually using it now, so there's certainly progress being made, and the fact that I'm actually using it now attests to that progress *NOT* being a simple illusion. So it'll come, even if it /does/ sometimes seem it's Duke-Nukem-Forever. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REQ: btrfs list option
On 01/09/2014 12:06 PM, Alex wrote: Chris Murphy colorremedies.com> writes: Specify the mount point for the Btrfs file system and it will list all subvols on that file system. Chris Murphy-- Thank you Chris. When I do that on my version of the 3.12 userland: # btrfs sub list / -o There is a bug for 'btrfs sub list -o path', i will send a patch for this Thanks, Wang returns nothing (with no error), which I wasn't quite expecting because there *are* other snapshots and subvols below '/' AND # btrfs sub list / -s correctly lists the snapshots only. I don't understand, what or if, I'm doing something wrong. Thank you in advance. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels
On Thu, Jan 09, 2014 at 12:49:48PM +0100, Stefan Behrens wrote: > On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote: > > Some users complaint that with latest btrfs-progs, they will > > fail to use send/receive. The problem is new tool will try > > to use uuid tree while it dosen't work on older kernel. > > > > Now we first check if we support uuid tree, if not we fall into > > normal search as previous way.i copy most of codes from Alexander > > Block's previous codes and did some adjustments to make it work. > > > > Signed-off-by: Alexander Block > > Signed-off-by: Wang Shilong > > --- > > send-utils.c | 352 > > ++- > > send-utils.h | 11 ++ > > 2 files changed, 359 insertions(+), 4 deletions(-) > > I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is > found. The code that you add will never be tested by anyone and will > become broken sooner or later. > > The new kernel is compatible to old progs and to new progs. But new > progs require a new kernel and IMO this is normal. No. Really, no. I think I would be extremely upset to upgrade, say, util-linux, only to discover that I needed a new kernel for cp to continue to work. I hope you would be, too. You may need to upgrade the kernel to get new features offered by a new userspace, but I think we should absolutely not be changing userspace in a way that makes it incompatible with older kernels. If that involves lots of fallback code that checks "is this ioctl available? if not, use this method instead", then so be it. At this point, a rewrite of code in the userspace tools should _not_ logically remove the old code it is replacing, but should keep the old behaviour to use when the new kernel interfaces it relies on are not present. > A printf is friendly > enough in this case. > > IMHO maintaining compatibility in progs to old kernels should be limited > to code that is small enough to not cost effort and problems in the future. The fallback code will remain stable and should require minimal maintenance, because it will only get run on older kernels -- which, by definition, won't be changing much. As long as there is no great refactoring of the old code (maybe a comment to mark it as legacy support, and that it shouldn't be reworked heavily?), I don't see a major problem here, even for quite large chunks of code. Hugo. > > diff --git a/send-utils.c b/send-utils.c > > index 874f8a5..1772d2c 100644 > > --- a/send-utils.c > > +++ b/send-utils.c > > @@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 > > root_id, > > return 0; > > } > > > > +static struct rb_node *tree_insert(struct rb_root *root, > > + struct subvol_info *si, > > + enum subvol_search_type type) > > +{ > > + struct rb_node **p = &root->rb_node; > > + struct rb_node *parent = NULL; > > + struct subvol_info *entry; > > + __s64 comp; > > + > > + while (*p) { > > + parent = *p; > > + if (type == subvol_search_by_received_uuid) { > > + entry = rb_entry(parent, struct subvol_info, > > + rb_received_node); > > + > > + comp = memcmp(entry->received_uuid, si->received_uuid, > > + BTRFS_UUID_SIZE); > > + if (!comp) { > > + if (entry->stransid < si->stransid) > > + comp = -1; > > + else if (entry->stransid > si->stransid) > > + comp = 1; > > + else > > + comp = 0; > > + } > > + } else if (type == subvol_search_by_uuid) { > > + entry = rb_entry(parent, struct subvol_info, > > + rb_local_node); > > + comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE); > > + } else if (type == subvol_search_by_root_id) { > > + entry = rb_entry(parent, struct subvol_info, > > + rb_root_id_node); > > + comp = entry->root_id - si->root_id; > > + } else if (type == subvol_search_by_path) { > > + entry = rb_entry(parent, struct subvol_info, > > + rb_path_node); > > + comp = strcmp(entry->path, si->path); > > + } else { > > + BUG(); > > + } > > + > > + if (comp < 0) > > + p = &(*p)->rb_left; > > + else if (comp > 0) > > + p = &(*p)->rb_right; > > + else > > + return parent; > > + } > > + > > + if (type == subvol_search_by_received_uuid) { > > + rb_link_node(&si->rb_received_node, parent, p); > > + rb_insert_color(&s
Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels
Hi Stefan, On 01/09/2014 07:49 PM, Stefan Behrens wrote: On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote: Some users complaint that with latest btrfs-progs, they will fail to use send/receive. The problem is new tool will try to use uuid tree while it dosen't work on older kernel. Now we first check if we support uuid tree, if not we fall into normal search as previous way.i copy most of codes from Alexander Block's previous codes and did some adjustments to make it work. Signed-off-by: Alexander Block Signed-off-by: Wang Shilong --- send-utils.c | 352 ++- send-utils.h | 11 ++ 2 files changed, 359 insertions(+), 4 deletions(-) I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is found. The code that you add will never be tested by anyone and will become broken sooner or later. The new kernel is compatible to old progs and to new progs. But new progs require a new kernel and IMO this is normal. A printf is friendly enough in this case. IMHO maintaining compatibility in progs to old kernels should be limited to code that is small enough to not cost effort and problems in the future. Firstly i'd say sorry about that i forgot cc you. Yeah, both ways are ok for me, Let's wait to see what is other people's opinions about this. ^_^ Thanks, Wang diff --git a/send-utils.c b/send-utils.c index 874f8a5..1772d2c 100644 --- a/send-utils.c +++ b/send-utils.c @@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 root_id, return 0; } +static struct rb_node *tree_insert(struct rb_root *root, + struct subvol_info *si, + enum subvol_search_type type) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct subvol_info *entry; + __s64 comp; + + while (*p) { + parent = *p; + if (type == subvol_search_by_received_uuid) { + entry = rb_entry(parent, struct subvol_info, + rb_received_node); + + comp = memcmp(entry->received_uuid, si->received_uuid, + BTRFS_UUID_SIZE); + if (!comp) { + if (entry->stransid < si->stransid) + comp = -1; + else if (entry->stransid > si->stransid) + comp = 1; + else + comp = 0; + } + } else if (type == subvol_search_by_uuid) { + entry = rb_entry(parent, struct subvol_info, + rb_local_node); + comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE); + } else if (type == subvol_search_by_root_id) { + entry = rb_entry(parent, struct subvol_info, + rb_root_id_node); + comp = entry->root_id - si->root_id; + } else if (type == subvol_search_by_path) { + entry = rb_entry(parent, struct subvol_info, + rb_path_node); + comp = strcmp(entry->path, si->path); + } else { + BUG(); + } + + if (comp < 0) + p = &(*p)->rb_left; + else if (comp > 0) + p = &(*p)->rb_right; + else + return parent; + } + + if (type == subvol_search_by_received_uuid) { + rb_link_node(&si->rb_received_node, parent, p); + rb_insert_color(&si->rb_received_node, root); + } else if (type == subvol_search_by_uuid) { + rb_link_node(&si->rb_local_node, parent, p); + rb_insert_color(&si->rb_local_node, root); + } else if (type == subvol_search_by_root_id) { + rb_link_node(&si->rb_root_id_node, parent, p); + rb_insert_color(&si->rb_root_id_node, root); + } else if (type == subvol_search_by_path) { + rb_link_node(&si->rb_path_node, parent, p); + rb_insert_color(&si->rb_path_node, root); + } + return NULL; +} + int btrfs_subvolid_resolve(int fd, char *path, size_t path_len, u64 subvol_id) { if (path_len < 1) @@ -255,13 +320,101 @@ static int btrfs_subvolid_resolve_sub(int fd, char *path, size_t *path_len, return 0; } +static int count_bytes(void *buf, int len, char b) +{ + int cnt = 0; + int i; + + for (i = 0; i < len; i++) { + if (((char *)buf)[i] == b) + cnt++; + } + return cnt; +} + void subvol_uuid_search_add(struct subvol_uuid_
Re: [PATCH] Btrfs-progs: make send/receive compatible with older kernels
On Thu, 9 Jan 2014 18:52:38 +0800, Wang Shilong wrote: > Some users complaint that with latest btrfs-progs, they will > fail to use send/receive. The problem is new tool will try > to use uuid tree while it dosen't work on older kernel. > > Now we first check if we support uuid tree, if not we fall into > normal search as previous way.i copy most of codes from Alexander > Block's previous codes and did some adjustments to make it work. > > Signed-off-by: Alexander Block > Signed-off-by: Wang Shilong > --- > send-utils.c | 352 > ++- > send-utils.h | 11 ++ > 2 files changed, 359 insertions(+), 4 deletions(-) I'd prefer a printf("Needs kernel 3.12 or better\n") if no UUID tree is found. The code that you add will never be tested by anyone and will become broken sooner or later. The new kernel is compatible to old progs and to new progs. But new progs require a new kernel and IMO this is normal. A printf is friendly enough in this case. IMHO maintaining compatibility in progs to old kernels should be limited to code that is small enough to not cost effort and problems in the future. > > diff --git a/send-utils.c b/send-utils.c > index 874f8a5..1772d2c 100644 > --- a/send-utils.c > +++ b/send-utils.c > @@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 root_id, > return 0; > } > > +static struct rb_node *tree_insert(struct rb_root *root, > +struct subvol_info *si, > +enum subvol_search_type type) > +{ > + struct rb_node **p = &root->rb_node; > + struct rb_node *parent = NULL; > + struct subvol_info *entry; > + __s64 comp; > + > + while (*p) { > + parent = *p; > + if (type == subvol_search_by_received_uuid) { > + entry = rb_entry(parent, struct subvol_info, > + rb_received_node); > + > + comp = memcmp(entry->received_uuid, si->received_uuid, > + BTRFS_UUID_SIZE); > + if (!comp) { > + if (entry->stransid < si->stransid) > + comp = -1; > + else if (entry->stransid > si->stransid) > + comp = 1; > + else > + comp = 0; > + } > + } else if (type == subvol_search_by_uuid) { > + entry = rb_entry(parent, struct subvol_info, > + rb_local_node); > + comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE); > + } else if (type == subvol_search_by_root_id) { > + entry = rb_entry(parent, struct subvol_info, > + rb_root_id_node); > + comp = entry->root_id - si->root_id; > + } else if (type == subvol_search_by_path) { > + entry = rb_entry(parent, struct subvol_info, > + rb_path_node); > + comp = strcmp(entry->path, si->path); > + } else { > + BUG(); > + } > + > + if (comp < 0) > + p = &(*p)->rb_left; > + else if (comp > 0) > + p = &(*p)->rb_right; > + else > + return parent; > + } > + > + if (type == subvol_search_by_received_uuid) { > + rb_link_node(&si->rb_received_node, parent, p); > + rb_insert_color(&si->rb_received_node, root); > + } else if (type == subvol_search_by_uuid) { > + rb_link_node(&si->rb_local_node, parent, p); > + rb_insert_color(&si->rb_local_node, root); > + } else if (type == subvol_search_by_root_id) { > + rb_link_node(&si->rb_root_id_node, parent, p); > + rb_insert_color(&si->rb_root_id_node, root); > + } else if (type == subvol_search_by_path) { > + rb_link_node(&si->rb_path_node, parent, p); > + rb_insert_color(&si->rb_path_node, root); > + } > + return NULL; > +} > + > int btrfs_subvolid_resolve(int fd, char *path, size_t path_len, u64 > subvol_id) > { > if (path_len < 1) > @@ -255,13 +320,101 @@ static int btrfs_subvolid_resolve_sub(int fd, char > *path, size_t *path_len, > return 0; > } > > +static int count_bytes(void *buf, int len, char b) > +{ > + int cnt = 0; > + int i; > + > + for (i = 0; i < len; i++) { > + if (((char *)buf)[i] == b) > + cnt++; > + } > + return cnt; > +} > + > void subvol_uuid_search_add(struct subvol_uuid_search *s, > struct subvol_info *si) > { > - if (si) { > - free(si->path); > - free(si); > +
Re: Problems with incremental send/receive
Hi Wang, thank you for your answer. I am using the latest btrfs-progs with the 3.12 kernel. I don't have access to the machine right now (it looks like it crashed :/) but I can send the exact versions when I'm home. Regards, Felix On Thu, Jan 9, 2014 at 3:10 AM, Wang Shilong wrote: > Hi Felix, > > It seems some reported this problem before. The problem for your below case > is because you use latest btrfs-progs(v3.12?), which will need kernel > update, > kernel 3.12 is ok. > > However, i think btrfs-progs should keep compatibility, i will send a patch > to > make things more friendly. > > Thanks, > Wang > > On 01/09/2014 06:04 AM, Felix Blanke wrote: >> >> Hi List, >> >> My backup stopped working and I can't figure out why. I'm using >> send/receive with the "-p" switch for incremental backups using the >> last snapshot as a parent snapshot for sending only the changed data. >> >> The problem occurs using my own backup script. After I discovered the >> problem I did a quick test using the exact commands from the wiki with >> the same result: It doesn't work. Here is the output: >> >> >> server ~ # ./test_snapshot.sh >> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/ >> /mnt/root1/snapshots/test >> Create a readonly snapshot of '/mnt/root1/@root_home/' in >> '/mnt/root1/snapshots/test' >> ++ sync >> ++ btrfs send /mnt/root1/snapshots/test >> ++ btrfs receive /mnt/backup1/ >> At subvol /mnt/root1/snapshots/test >> At subvol test >> ++ btrfs subvolume snapshot -r /mnt/root1/@root_home/ >> /mnt/root1/snapshots/test_new >> Create a readonly snapshot of '/mnt/root1/@root_home/' in >> '/mnt/root1/snapshots/test_new' >> ++ sync >> ++ btrfs send -p /mnt/root1/snapshots/test /mnt/root1/snapshots/test_new >> ++ btrfs receive /mnt/backup1/ >> At subvol /mnt/root1/snapshots/test_new >> At snapshot test_new >> ERROR: open @/test failed. No such file or directory >> >> I don't get where the "@/" in front of the snapshot name comes from. >> It could be that I had a subvolume named @, but this doesn't exists >> anymore and I don't understand why this would be important for the >> send/receive. >> >> Some more details about the fs: >> >> server ~ # btrfs subvol list /mnt/root1/ >> ID 259 gen 568053 top level 5 path @root >> ID 261 gen 568053 top level 5 path @var >> ID 263 gen 568049 top level 5 path @home >> ID 302 gen 568053 top level 5 path @owncloud_chroot >> ID 421 gen 568038 top level 5 path @root_home >> ID 30560 gen 563661 top level 5 path snapshots/home_2014-01-06-19:33_d >> ID 30561 gen 563665 top level 5 path >> snapshots/owncloud_chroot_2014-01-06-19:34_d >> ID 30562 gen 563674 top level 5 path >> snapshots/root_home_2014-01-06-19:38_d >> ID 30563 gen 563675 top level 5 path snapshots/var_2014-01-06-19:39_d >> ID 30564 gen 563697 top level 5 path snapshots/root_2014-01-06-19:50_d >> >> server ~ # btrfs subvol get-default /mnt/root1/ >> ID 5 (FS_TREE) >> >> server ~ # ls -l /mnt/root1/ >> total 0 >> drwxr-xr-x. 1 root root 30 May 10 2013 @home >> drwxr-xr-x. 1 root root 134 Jan 5 19:27 @owncloud_chroot >> drwxr-xr-x. 1 root root 204 Nov 24 18:16 @root >> drwx--. 1 root root 468 Jan 8 22:47 @root_home >> drwxr-xr-x. 1 root root 114 Oct 7 17:39 @var >> drwx--. 1 root root 420 Jan 8 22:50 snapshots >> >> >> Any ideas? Thanks in advance. >> >> >> Regards, >> Felix >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message tomajord...@vger.kernel.org >> More majordomo info athttp://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: make send/receive compatible with older kernels
Some users complaint that with latest btrfs-progs, they will fail to use send/receive. The problem is new tool will try to use uuid tree while it dosen't work on older kernel. Now we first check if we support uuid tree, if not we fall into normal search as previous way.i copy most of codes from Alexander Block's previous codes and did some adjustments to make it work. Signed-off-by: Alexander Block Signed-off-by: Wang Shilong --- send-utils.c | 352 ++- send-utils.h | 11 ++ 2 files changed, 359 insertions(+), 4 deletions(-) diff --git a/send-utils.c b/send-utils.c index 874f8a5..1772d2c 100644 --- a/send-utils.c +++ b/send-utils.c @@ -159,6 +159,71 @@ static int btrfs_read_root_item(int mnt_fd, u64 root_id, return 0; } +static struct rb_node *tree_insert(struct rb_root *root, + struct subvol_info *si, + enum subvol_search_type type) +{ + struct rb_node **p = &root->rb_node; + struct rb_node *parent = NULL; + struct subvol_info *entry; + __s64 comp; + + while (*p) { + parent = *p; + if (type == subvol_search_by_received_uuid) { + entry = rb_entry(parent, struct subvol_info, + rb_received_node); + + comp = memcmp(entry->received_uuid, si->received_uuid, + BTRFS_UUID_SIZE); + if (!comp) { + if (entry->stransid < si->stransid) + comp = -1; + else if (entry->stransid > si->stransid) + comp = 1; + else + comp = 0; + } + } else if (type == subvol_search_by_uuid) { + entry = rb_entry(parent, struct subvol_info, + rb_local_node); + comp = memcmp(entry->uuid, si->uuid, BTRFS_UUID_SIZE); + } else if (type == subvol_search_by_root_id) { + entry = rb_entry(parent, struct subvol_info, + rb_root_id_node); + comp = entry->root_id - si->root_id; + } else if (type == subvol_search_by_path) { + entry = rb_entry(parent, struct subvol_info, + rb_path_node); + comp = strcmp(entry->path, si->path); + } else { + BUG(); + } + + if (comp < 0) + p = &(*p)->rb_left; + else if (comp > 0) + p = &(*p)->rb_right; + else + return parent; + } + + if (type == subvol_search_by_received_uuid) { + rb_link_node(&si->rb_received_node, parent, p); + rb_insert_color(&si->rb_received_node, root); + } else if (type == subvol_search_by_uuid) { + rb_link_node(&si->rb_local_node, parent, p); + rb_insert_color(&si->rb_local_node, root); + } else if (type == subvol_search_by_root_id) { + rb_link_node(&si->rb_root_id_node, parent, p); + rb_insert_color(&si->rb_root_id_node, root); + } else if (type == subvol_search_by_path) { + rb_link_node(&si->rb_path_node, parent, p); + rb_insert_color(&si->rb_path_node, root); + } + return NULL; +} + int btrfs_subvolid_resolve(int fd, char *path, size_t path_len, u64 subvol_id) { if (path_len < 1) @@ -255,13 +320,101 @@ static int btrfs_subvolid_resolve_sub(int fd, char *path, size_t *path_len, return 0; } +static int count_bytes(void *buf, int len, char b) +{ + int cnt = 0; + int i; + + for (i = 0; i < len; i++) { + if (((char *)buf)[i] == b) + cnt++; + } + return cnt; +} + void subvol_uuid_search_add(struct subvol_uuid_search *s, struct subvol_info *si) { - if (si) { - free(si->path); - free(si); + int cnt; + + tree_insert(&s->root_id_subvols, si, subvol_search_by_root_id); + tree_insert(&s->path_subvols, si, subvol_search_by_path); + + cnt = count_bytes(si->uuid, BTRFS_UUID_SIZE, 0); + if (cnt != BTRFS_UUID_SIZE) + tree_insert(&s->local_subvols, si, subvol_search_by_uuid); + cnt = count_bytes(si->received_uuid, BTRFS_UUID_SIZE, 0); + if (cnt != BTRFS_UUID_SIZE) + tree_insert(&s->received_subvols, si, + subvol_search_by_received_uuid); +} + +static struct subvol_info *tree_search(struct rb_root *root, +
Re: How does btrfs handle bad blocks in raid1?
On Thu, Jan 09, 2014 at 11:26:26AM +0100, Clemens Eisserer wrote: > Hi, > > I am running write-intensive (well sort of, one write every 10s) > workloads on cheap flash media which proved to be horribly unreliable. > A 32GB microSDHC card reported bad blocks after 4 days, while a usb > pen drive returns bogus data without any warning at all. > > So I wonder, how would btrfs behave in raid1 on two such devices? > Would it simply mark bad blocks as "bad" and continue to be > operational, or will it bail out when some block can not be > read/written anymore on one of the two devices? If a block is read and fails its checksum, then the other copy (in RAID-1) is checked and used if it's good. The bad copy is rewritten to use the good data. If the block is bad such that writing to it won't fix it, then there's probably two cases: the device returns an IO error, in which case I suspect (but can't be sure) that the FS will go read-only. Or the device silently fails the write and claims success, in which case you're back to the situation above of the block failing its checksum. There's no marking of bad blocks right now, and I don't know of anyone working on the feature, so the FS will probably keep going back to the bad blocks as it makes CoW copies for modification. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Trouble rather the tiger in his lair than the sage amongst --- his books for to you kingdoms and their armies are mighty and enduring, but to him they are but toys of the moment to be overturned by the flicking of a finger. signature.asc Description: Digital signature
How does btrfs handle bad blocks in raid1?
Hi, I am running write-intensive (well sort of, one write every 10s) workloads on cheap flash media which proved to be horribly unreliable. A 32GB microSDHC card reported bad blocks after 4 days, while a usb pen drive returns bogus data without any warning at all. So I wonder, how would btrfs behave in raid1 on two such devices? Would it simply mark bad blocks as "bad" and continue to be operational, or will it bail out when some block can not be read/written anymore on one of the two devices? Thank you in advance, Clemens -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: REQ: btrfs list option
Chris Murphy colorremedies.com> writes: > > Hmm, actually you might have found a bug. > > > > Small typo while we're at it, below should have one l. > > kernel-3.13.0-0.rc6.git0.1.fc21.x86_64 > btrfs-progs-3.12-1.fc20.x86_64 > > Chris Murphy > Thank you muchly! I'm kinda glad because I didn't really understand your second response in context. ;-) I would have been stuck had you not updated it! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html