Re: btrfs-progs: initial reference count of extent buffer is correct?
On Mon, Aug 25, 2014 at 02:26:49PM +0900, Naohiro Aota wrote: Hi, list I'm having trouble with my btrfs FS recently and running btrfs check to try to fix the FS. Unfortunately, it aborted with: btrfsck: root-tree.c:81: btrfs_update_root: Assertion `!(ret != 0)' failed. It means that extent tree root is not found in tree root tree! Then I added btrfs_print_leaf() there to see what is happening there. There were (... METADATA_ITEM 0) keys listed. Well, I found tree root tree's root extent buffer is somewhat replaced by a extent buffer from the extent tree. Reading the code, it seems that free_some_buffers() reclaim extent buffers allocated to root trees because they are not extent_buffer_get()ed (i.e. @refs == 1). To reproduce this problem, try running this code. This program first print root tree node's bytenr, and scan some trees. If your FS is large enough to run free_some_buffers(), tree root node's bytenr after the scan would be different. #include stdio.h #include ctree.h #include disk-io.h void scan_tree(struct btrfs_root *root, struct extent_buffer *eb) { u32 i; u32 nr; nr = btrfs_header_nritems(eb); if (btrfs_is_leaf(eb)) return; u32 size = btrfs_level_size(root, btrfs_header_level(eb) - 1); for (i = 0; i nr; i++) { if (btrfs_is_leaf(eb)) return; u64 bytenr = btrfs_node_blockptr(eb, i); struct extent_buffer *next = read_tree_block(root, bytenr, size, btrfs_node_ptr_generation(eb, i)); if (!next) continue; scan_tree(root, next); } } int main(int ac, char **av) { struct btrfs_fs_info *info; struct btrfs_root *root; info = open_ctree_fs_info(av[1], 0, 0, OPEN_CTREE_PARTIAL); root = info-fs_root; printf(tree root %lld\n, info-tree_root-node-start); scan_tree(info-fs_root, info-extent_root-node); scan_tree(info-fs_root, info-csum_root-node); scan_tree(info-fs_root, info-fs_root-node); printf(tree root %lld\n, info-tree_root-node-start); return close_ctree(root); } On my environment, the above code print the following result. Tree root tree variable is eventually pointing to another extent! $ ./btrfs-reproduce /dev/sda3 tree root 91393835008 tree root 49102848 I found commit 53ee1bccf99cd5b474fe1aa857b7dd176e3a1407 changed the initial @refs to 1, stating that we don't give enough free_extent_buffer() to reduce the eb's references to zero so that the eb can finally be freed, but I don't think this is correct. Even if initial @refs == 2, one free_extent_buffer() would make the @refs to 1 and so let it reclaimed by free_some_buffer(), so it does not seems to be a problem for me... I think there are some collides how to use extent buffer: should __alloc_extent_buffer set @refs = 2 for the caller or should the code call extent_buffer_get() by themselves everywhere you allocate eb before any other eb allocation not to let the first eb reclaimed? How to fix this problem? revert 53ee1bccf99cd5b474fe1aa857b7dd176e3a1407 is the collect way? or add missing extent_buffer_get() everywhere allocating is done? You may think of it twice, commit 53ee1bccf99cd5b474fe1aa857b7dd176e3a1407 is to fix a bug of assigning a free block to two different extent buffers, ie. two different extent buffers' share the same eb-start, so it's not just bumping a reference cnt. Right now we want to be consistent with the kernel side, decreasing eb-refs=0 means it'd be freed, so droping free_some_buffer() can be a good choice. And for caching extent buffer, we've increased eb-refs by 1 to keep it in the cache rbtree. thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: improve free space cache management and space allocation
While under random IO, a block group's free space cache eventually reaches a state where it has a mix of extent entries and bitmap entries representing free space regions. As later free space regions are returned to the cache, some of them are merged with existing extent entries if they are contiguous with them. But others are not merged, because despite the existence of adjacent free space regions in the cache, the merging doesn't happen because the existing free space regions are represented in bitmap extents. Even when new free space regions are merged with existing extent entries (enlarging the free space range they represent), we create chances of having after an enlarged region that is contiguous with some other region represented in a bitmap entry. Both clustered and non-clustered space allocation work by iterating over our extent and bitmap entries and skipping any that represents a region smaller then the allocation request (and giving preference to extent entries before bitmap entries). By having a contiguous free space region that is represented by 2 (or more) entries (mix of extent and bitmap entries), we end up not satisfying an allocation request with a size larger than the size of any of the entries but no larger than the sum of their sizes. Making the caller assume we're under a ENOSPC condition or force it to allocate multiple smaller space regions (as we do for file data writes), which adds extra overhead and more chances of causing fragmentation due to the smaller regions being all spread apart from each other (more likely when under concurrency). For example, if we have the following in the cache: * extent entry representing free space range: [128Mb - 256Kb, 128Mb[ * bitmap entry covering the range [128Mb, 256Mb[, but only with the bits representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that space in this 128Mb area is marked as free An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb, would fail before, despite the existence of such contiguous free space area in the cache. The caller could only allocate up to 768Kb of space at once and later another 256Kb (or vice-versa). In between each smaller allocation request, another task working on a different file/inode might come in and take that space, preventing the former task of getting a contiguous 1Mb region of free space. Therefore this change implements the ability to move free space from bitmap entries into existing and new free space regions represented with extent entries. This is done when a space region is added to the cache. A test was added to the sanity tests that explains in detail the issue too. Some performance test results with compilebench on a 4 cores machine, with 32Gb of ram and using an HDD follow. Test: compilebench -D /mnt -i 30 -r 1000 --makej Before this change: intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s) compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s) read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s) delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s) After this change: intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s) compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s) read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s) delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s) Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/free-space-cache.c | 149 ++- fs/btrfs/tests/free-space-tests.c | 514 ++ 2 files changed, 662 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 2f0fe10..23632ba 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1951,6 +1951,137 @@ out: return ret; } +static void steal_from_bitmap_to_end(struct btrfs_free_space_ctl *ctl, +struct btrfs_free_space *info, +bool update_stat) +{ + struct btrfs_free_space *bitmap; + u64 bitmap_offset = info-offset; + unsigned long i; + unsigned long j; + const u64 end = info-offset + info-bytes; + u64 bytes; + +again: + bitmap = tree_search_offset(ctl, offset_to_bitmap(ctl, bitmap_offset), + 1, 0); + if (!bitmap) + goto out; + + if (end bitmap-offset || (bitmap-offset + bitmap-bytes end)) + return; + + i = offset_to_bit(bitmap-offset, ctl-unit, end); + j = find_next_zero_bit(bitmap-bitmap, BITS_PER_BITMAP, i); + if (j == i) + return; + bytes = (j - i) * ctl-unit; + info-bytes += bytes; + + if (update_stat) + bitmap_clear_bits(ctl, bitmap, end, bytes); + else + __bitmap_clear_bits(ctl, bitmap, end, bytes); + +
[PATCH] Btrfs: fix corruption after write/fsync failure + fsync + log recovery
While writing to a file, in inode.c:cow_file_range() (and same applies to submit_compressed_extents()), after reserving an extent for the file data, we create a new extent map for the written range and insert it into the extent map cache. After that, we create an ordered operation, but if it fails (due to a transient/temporary-ENOMEM), we return without dropping that extent map, which points to a reserved extent that is freed when we return. A subsequent incremental fsync (when the btrfs inode doesn't have the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and logs a file extent item based on that extent map, which points to a disk extent that doesn't contain valid data - it was freed by us earlier, at this point it might contain any random/garbage data. Therefore, if we reach an error condition when cowing a file range after we added the new extent map to the cache, drop it from the cache before returning. Some sequence of steps that lead to this: $ mkfs.btrfs -f /dev/sdd $ mount -o commit= /dev/sdd /mnt $ cd /mnt $ xfs_io -f -c pwrite -S 0x01 -b 4096 0 4096 -c fsync foo $ xfs_io -c pwrite -S 0x02 -b 4096 4096 4096 $ sync $ od -t x1 foo 000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 001 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 * 002 $ xfs_io -c pwrite -S 0xa1 -b 4096 0 4096 foo # Now this write + fsync fail with -ENOMEM, which was returned by # btrfs_add_ordered_extent() in inode.c:cow_file_range(). $ xfs_io -c pwrite -S 0xff -b 4096 4096 4096 foo $ xfs_io -c fsync foo fsync: Cannot allocate memory # Now do a new write + fsync, which will succeed. Our previous # -ENOMEM was a transient/temporary error. $ xfs_io -c pwrite -S 0xee -b 4096 16384 4096 foo $ xfs_io -c fsync foo # Our file content (in page cache) is now: $ od -t x1 foo 000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 * 001 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 002 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 004 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 005 # Now reboot the machine, and mount the fs, so that fsync log replay # takes place. # The file content is now weird, in particular the first 8Kb, which # do not match our data before nor after the sync command above. $ od -t x1 foo 000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 001 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 002 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 004 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 005 # In fact these first 4Kb are a duplicate of the last 4kb block. # The last write got an extent map/file extent item that points to # the same disk extent that we got in the write+fsync that failed # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to # verify that: $ btrfs-debug-tree /dev/sdd (...) item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 12582912 nr 8192 extent data offset 0 nr 8192 ram 8192 item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 8192 ram 8192 item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 extent data disk byte 12582912 nr 4096 extent data offset 0 nr 4096 ram 4096 $ umount /dev/sdd $ btrfsck /dev/sdd Checking filesystem on /dev/sdd UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f checking extents extent item 12582912 has multiple extent items ref mismatch on [12582912 4096] extent item 1, found 2 Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 backpointer mismatch on [12582912 4096] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots root 5 inode 257 errors 1000, some csum missing found 131074 bytes used err is 1 total csum bytes: 4 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123404 file data blocks allocated: 274432 referenced 274432 Btrfs v3.14.1-96-gcc7fd5a-dirty Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/inode.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c678dea..16e8146 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -792,8 +792,12 @@ retry: ins.offset, BTRFS_ORDERED_COMPRESSED, async_extent-compress_type); - if (ret) + if (ret) {
Re: btrfs restore memory corruption (bug: 82701)
On Mon, 2014-08-25 at 10:58 +0200, Marc Dietrich wrote: Am Freitag 22 August 2014, 10:42:18 schrieb Marc Dietrich: Am Freitag, 22. August 2014, 14:43:45 schrieb Gui Hecheng: On Thu, 2014-08-21 at 16:19 +0200, Marc Dietrich wrote: Am Donnerstag, 21. August 2014, 17:52:16 schrieb Gui Hecheng: On Mon, 2014-08-18 at 11:25 +0200, Marc Dietrich wrote: Hi, I did a checkout of the latest btrfs progs to repair my damaged filesystem. Running btrfs restore gives me several failed to inflate: -6 and crashes with some memory corruption. I ran it again with valgrind and got: valgrind --log-file=x2 -v --leak-check=yes btrfs restore /dev/sda9 /mnt/backup ==8528== Memcheck, a memory error detector ==8528== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. ==8528== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info ==8528== Command: btrfs restore /dev/sda9 /mnt/backup ==8528== Parent PID: 8453 ==8528== ==8528== Syscall param pwrite64(buf) points to uninitialised byte(s) ==8528==at 0x59BE3C3: __pwrite_nocancel (in /lib64/libpthread-2.18.so) ==8528==by 0x41F22F: search_dir (cmds-restore.c:392) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x4204B8: cmd_restore (cmds-restore.c:1284) ==8528==by 0x4043FE: main (btrfs.c:286) ==8528== Address 0x66956a0 is 7,056 bytes inside a block of size 8,192 alloc'd ==8528==at 0x4C277AB: malloc (in /usr/lib64/valgrind/vgpreload_memcheck- amd64-linux.so) ==8528==by 0x41EEAD: search_dir (cmds-restore.c:316) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x41F8D0: search_dir (cmds-restore.c:895) ==8528==by 0x4204B8: cmd_restore (cmds-restore.c:1284) ==8528==by 0x4043FE: main (btrfs.c:286) ---[snip]- leaks ... -- For the leak below... I've no idea why the @decompress_lzo() is not statisfied with @inbuf with the exact size of the disk bytes. Or maybe the compressed data had just sufferred damages... BTW, when you wrote your data, did that kernel has the following commit for btrfs? commit: 59516f6017c589e7316418fda6128ba8f829a77f mmh, I used the master branch which is still on 3.14.2 (from k.org). Ah, there is a development branch on another repo (repo.or.cz). Why oh why? Guy, sorry to quote an earlier mail, I forgot to add you as CC on you latest post and I'm not subscribed to the list. There is a development branch for btrfs-progs from david: http://github.com/kdave/btrfs-progs.git if you would like to try. ok, thanks will try. But here, what I mean is your *kernel* version when you wrote your data. I'm using btrfs since 3.14 or so (and maybe also some random distro kernel based on 3.11). The partition contained a lot of larger git trees and virtual machines - yes, not ideal for btrfs but a nice testcase ... There is a change for btrfs-restore which depends on a kernel commit. If you wrote your data with a older kernel and apply the 3.14.2 btrfs-progs to restore, then there may be wandering stuffs. wow. That should never happend I think. Userspace should always be able to fix corruptions made by earlier kernels (except disk layout changes maybe). Now, I am just suspecting such a scenario. Possbile. So how to proceed? If I checkout the latest brtfs from the repo above and restore again, are you still interested in the results? Ah, I think you could clone the progs from the repo and apply the two small pieces that I mentioned before. Yes, I am still trying to follow the issues with restore. It seems btrfs-restore needs more effect from btrfs developers since it doesn't survive tough scenarioes. It seems there are lots of people reporting corruptions on the list and also lots of fixes posted. Maybe it's better to restart from new (format a the partiton) and report problems happen after that. What do you think? Oh, I think you've just found a really good case for btrfs-restore. Maybe you could keep a image of that, just like Zooko did here: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36701.html Thanks, -Gui Marc -- To unsubscribe from this list: send the line unsubscribe
Re: superblock checksum mismatch after crash, cannot mount
On 2014-08-24 15:48, Chris Murphy wrote: On Aug 24, 2014, at 10:59 AM, Flash ROM flashromg...@yandex.com wrote: While it sounds dumb, this strange thing being done to put partition table in separate erase block, so it never read-modify-written when FAT entries are updated. Should something go wrong, FAR can recover from backup copy. But erased partition table just suxx. Then, FAT tables are aligned in way to fit well around erase block bounds. I think you seriously overestimate the knowledge of camera manufacturer's about the details of flash storage; and any ability to discover it; and any willingness on the part of the flash manufacturer to reveal such underlying details. The whole point of these cards is to completely abstract the reality of the underlying hardware from the application layer - in this case the camera or mobile device using it. If you really know what you are doing, it is possible to determine erase block size by looking at device performance timings, with surprisingly high accuracy (assuming you aren't trying to have software do it for you). I've actually done this before on several occasions, with nearly 100% success. Also, with SDXC exFAT is now specified. And it has only one FAT there isn't a backup FAT. So they're even more difficult to recover data from should things go awry filesystem wise. It's too bad that TFAT didn't catch on, as it would have been great for SD cards if it could be configured to put each FAT on a different erase block. This said, you can *try* to reformat, BUT no standard OS of firmware formatter will help you with default settings. They can't know geometry of underlying NAND and controller properties. There is no standard, widely accepted way to get such information from card. No matter if you use OS formatter, camera formatter or whatever. YOU WILL RUIN factory format (which is crafted in best possible way) and replace it with another, very likely suboptimal one. It's recommended by the card manufacturers to reformat it in each camera its inserted into. It's the only recommended way to erase the sd card for re-use, they don't recommend selectively deleting images. And it's known that one camera's partition table and formatting can irritate another camera make/model if the card isn't reformatted by that camera. It's not just cameras that have this issue, a lot of other hardware makes stupid assumptions about the format of media. The first firmware release for the Nintendo Wii for example, chocked if you tried to use an SD card with more than one partition on it, and old desktop versions of Windows won't ever show you anything other than the first partition on an SD card (or most USB storage devices for that matter). smime.p7s Description: S/MIME Cryptographic Signature
Most recent stable enough btrfs-tools?
Hello! I am a bit confused about btrfs-progs git repo URLs and branches. What is the latest stuff that stills supposed to work okay? My /home BTRFS RAID 1 on two SSDs filesystem has an error with btrfs check that btrfs-tools 3.14.1 cannot repair. The repo I found on git.kernel.org git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git seems to be stuck at 3.14.2. But well, now I see, there is an integration branch, buts its also just at: commit 7b050795a01acb2bec0db84991b4bc9c8680e275 Author: Chris Mason c...@fb.com Date: Wed May 28 17:01:39 2014 -0400 scrub: fix uninit return variable in scrub_progress_cycle Signed-off-by: Chris Mason c...@fb.com Before posting details on this I would like to make sure trying with the most recent stuff. What version do you recommend to try? Kernel wise I am on 3.16.1 plus to BTRFS hang / corruption fix patches from this mailing list. But I intend to switch to 3.17-rc2 once it is out. Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Btrfs: fix task hang under heavy compressed write
On 08/15/2014 11:36 AM, Liu Bo wrote: This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. [ great description ] I ran this through tests last week, and an overnight test over the weekend. It's in my for-linus branch now, along with everything else I plan on sending for rc3. Please double check my merge, I had to undo your rebase onto Miao's patches. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Btrfs: fix task hang under heavy compressed write
On Mon, Aug 25, 2014 at 10:58:13AM -0400, Chris Mason wrote: On 08/15/2014 11:36 AM, Liu Bo wrote: This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. [ great description ] I ran this through tests last week, and an overnight test over the weekend. It's in my for-linus branch now, along with everything else I plan on sending for rc3. Please double check my merge, I had to undo your rebase onto Miao's patches. Just checked, looks good ;) thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Don't continue mounting when superblock csum mismatches even generation is less than 10.
On Wed, Aug 20, 2014 at 10:34:53AM +0800, Qu Wenruo wrote: Although as mentioned in the reply to David, the main problem is that I found two disk images with crazy values in superblock and wrong csum, but generation is still 4, and ignoring the csum error caused kernel BUG. Can you please share the dump of the broken superblock (btrfs-show-super)? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Most recent stable enough btrfs-tools?
On Aug 25, 2014, at 6:00 AM, Martin Steigerwald mar...@lichtvoll.de wrote: What is the latest stuff that stills supposed to work okay? I'm new to git so take this with a grain of salt, but this returns no differences: git diff mason/master sterba/v3.16.x So I'd say we're about to see a btrfs-progs released. mason=git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git sterba=git://repo.or.cz/btrfs-progs-unstable/devel.git Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
typo in btrfs-progs master/v3.16.x
git diff mason/master sterba/integration-20140729 diff --git a/cmds-scrub.c b/cmds-scrub.c index 731c5c9..0bf06ee 100644 --- a/cmds-scrub.c +++ b/cmds-scrub.c @@ -1527,16 +1527,16 @@ out: static const char * const cmd_scrub_start_usage[] = { btrfs scrub start [-BdqrRf] [-c ioprio_class -n ioprio_classdata] path|device, - Start a new scrub. If a scrub is already running, the new one fails., + Start a new scrub, , -B do not background, -d stats per device (-B only), -q be quiet, -r read only mode, - -R raw print mode, print full data instead of summary, + -R raw print mode, print full data instead of summary Looks like a missing , at the end of this line. All other lines end in , Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: typo in btrfs-progs master/v3.16.x
On Aug 25, 2014, at 4:32 PM, David Sterba dste...@suse.cz wrote: On Mon, Aug 25, 2014 at 04:09:16PM -0600, Chris Murphy wrote: static const char * const cmd_scrub_start_usage[] = { btrfs scrub start [-BdqrRf] [-c ioprio_class -n ioprio_classdata] path|device, - Start a new scrub. If a scrub is already running, the new one fails., + Start a new scrub, , -B do not background, -d stats per device (-B only), -q be quiet, -r read only mode, - -R raw print mode, print full data instead of summary, + -R raw print mode, print full data instead of summary Looks like a missing , at the end of this line. All other lines end in , Thanks for checking, the v3.16.x version is correct. Right. I had the diffs reversed from what I thought they were, so I came to the wrong conclusion. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix corruption after write/fsync failure + fsync + log recovery
On Mon, Aug 25, 2014 at 10:43:00AM +0100, Filipe Manana wrote: While writing to a file, in inode.c:cow_file_range() (and same applies to submit_compressed_extents()), after reserving an extent for the file data, we create a new extent map for the written range and insert it into the extent map cache. After that, we create an ordered operation, but if it fails (due to a transient/temporary-ENOMEM), we return without dropping that extent map, which points to a reserved extent that is freed when we return. A subsequent incremental fsync (when the btrfs inode doesn't have the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and logs a file extent item based on that extent map, which points to a disk extent that doesn't contain valid data - it was freed by us earlier, at this point it might contain any random/garbage data. Therefore, if we reach an error condition when cowing a file range after we added the new extent map to the cache, drop it from the cache before returning. Some sequence of steps that lead to this: $ mkfs.btrfs -f /dev/sdd $ mount -o commit= /dev/sdd /mnt $ cd /mnt $ xfs_io -f -c pwrite -S 0x01 -b 4096 0 4096 -c fsync foo $ xfs_io -c pwrite -S 0x02 -b 4096 4096 4096 $ sync $ od -t x1 foo 000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 001 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 * 002 $ xfs_io -c pwrite -S 0xa1 -b 4096 0 4096 foo # Now this write + fsync fail with -ENOMEM, which was returned by # btrfs_add_ordered_extent() in inode.c:cow_file_range(). $ xfs_io -c pwrite -S 0xff -b 4096 4096 4096 foo $ xfs_io -c fsync foo fsync: Cannot allocate memory # Now do a new write + fsync, which will succeed. Our previous # -ENOMEM was a transient/temporary error. $ xfs_io -c pwrite -S 0xee -b 4096 16384 4096 foo $ xfs_io -c fsync foo # Our file content (in page cache) is now: $ od -t x1 foo 000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 * 001 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 002 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 004 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 005 # Now reboot the machine, and mount the fs, so that fsync log replay # takes place. # The file content is now weird, in particular the first 8Kb, which # do not match our data before nor after the sync command above. $ od -t x1 foo 000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 001 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 002 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 004 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 005 # In fact these first 4Kb are a duplicate of the last 4kb block. # The last write got an extent map/file extent item that points to # the same disk extent that we got in the write+fsync that failed # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to # verify that: $ btrfs-debug-tree /dev/sdd (...) item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 12582912 nr 8192 extent data offset 0 nr 8192 ram 8192 item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 8192 ram 8192 item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 extent data disk byte 12582912 nr 4096 extent data offset 0 nr 4096 ram 4096 $ umount /dev/sdd $ btrfsck /dev/sdd Checking filesystem on /dev/sdd UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f checking extents extent item 12582912 has multiple extent items ref mismatch on [12582912 4096] extent item 1, found 2 Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 backpointer mismatch on [12582912 4096] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots root 5 inode 257 errors 1000, some csum missing found 131074 bytes used err is 1 total csum bytes: 4 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123404 file data blocks allocated: 274432 referenced 274432 Btrfs v3.14.1-96-gcc7fd5a-dirty Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/inode.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index c678dea..16e8146 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -792,8 +792,12 @@ retry: ins.offset,