Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem
On 04.08.2016 18:53, Lutz Vieweg wrote: > > I was today hit by what I think is probably the same bug: > A btrfs on a close-to-4TB sized block device, only half filled > to almost exactly 2 TB, suddenly says "no space left on device" > upon any attempt to write to it. The filesystem was NOT automatically > switched to read-only by the kernel, I should mention. > > Re-mounting (which is a pain as this filesystem is used for > $HOMEs of a multitude of active users who I have to kick from > the server for doing things like re-mounting) removed the symptom > for now, but from what I can read in linux-btrfs mailing list > archives, it pretty likely the symptom will re-appear. > > Here are some more details: > > Software versions: >> linux-4.6.1 (vanilla from kernel.org) ... > > dmesg output from the time the "no space left on device"-symptom > appeared: > >> [5171203.601620] WARNING: CPU: 4 PID: 23208 at fs/btrfs/inode.c:9261 >> btrfs_destroy_inode+0x263/0x2a0 [btrfs] > ... >> [5171230.306037] WARNING: CPU: 18 PID: 12656 at fs/btrfs/extent-tree.c:4233 >> btrfs_free_reserved_data_space_noquota+0xf3/0x100 [btrfs] Sounds like the bug I hit too also .. To fix this you'll need : crazy@zwerg:~/Work/linux-git$ git show 8b8b08cbf commit 8b8b08cbfb9021af4b54b4175fc4c51d655aac8c Author: Chris Mason Date: Tue Jul 19 05:52:36 2016 -0700 Btrfs: fix delalloc accounting after copy_from_user faults Commit 56244ef151c3cd11 was almost but not quite enough to fix the reservation math after btrfs_copy_from_user returned partial copies. Some users are still seeing warnings in btrfs_destroy_inode, and with a long enough test run I'm able to trigger them as well. This patch fixes the accounting math again, bringing it much closer to the way it was before the sectorsize conversion Chandan did. The problem is accounting for the offset into the page/sector when we do a partial copy. This one just uses the dirty_sectors variable which should already be updated properly. Signed-off-by: Chris Mason cc: sta...@vger.kernel.org # v4.6+ diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index f3f61d1..bcfb4a2 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1629,13 +1629,11 @@ again: * managed to copy. */ if (num_sectors > dirty_sectors) { - /* -* we round down because we don't want to count -* any partial blocks actually sent through the -* IO machines -*/ - release_bytes = round_down(release_bytes - copied, - root->sectorsize); + + /* release everything except the sectors we dirtied */ + release_bytes -= dirty_sectors << + root->fs_info->sb->s_blocksize_bits; + if (copied > 0) { spin_lock(&BTRFS_I(inode)->lock); BTRFS_I(inode)->outstanding_extents++; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.8] btrfs heats my room with lock contention
On 08/04/2016 11:01 PM, Dave Chinner wrote: On Thu, Aug 04, 2016 at 10:28:44AM -0400, Chris Mason wrote: On 08/04/2016 02:41 AM, Dave Chinner wrote: Simple test. 8GB pmem device on a 16p machine: # mkfs.btrfs /dev/pmem1 # mount /dev/pmem1 /mnt/scratch # dbench -t 60 -D /mnt/scratch 16 And heat your room with the warm air rising from your CPUs. Top half of the btrfs profile looks like: . Performance vs CPu usage is: nprocs throughput cpu usage 1 440MB/s 50% 2 770MB/s 100% 4 880MB/s 250% 8 690MB/s 450% 16 280MB/s 950% In comparision, at 8-16 threads ext4 is running at ~2600MB/s and XFS is running at ~3800MB/s. Even if I throw 300-400 processes at ext4 and XFS, they only drop to ~1500-2000MB/s as they hit internal limits. Yes, with dbench btrfs does much much better if you make a subvol per dbench dir. The difference is pretty dramatic. I'm working on it this month, but focusing more on database workloads right now. You've been giving this answer to lock contention reports for the past 6-7 years, Chris. I really don't care about getting big benchmark numbers with contrived setups - the "use multiple subvolumes" solution is simply not practical for users or their workloads. The default config should behave sanely and not not contribute to global warming like this. The btree setup that makes lock contention here makes some other benchmarks faster. Needing to create subvolumes in order to fix performance problems on dbench is far from ideal, but in production here the tradeoffs have been worth it. Basically this one definitely comes up during dbench and fs_mark and much less often elsewhere. For the workloads that hit this lock contention, splitting things out into subvolumes hugely reduces metadata fragmentation on reads. So it's not just CPU we're helping with subvolumes but spindle time too. It's true I haven't invested time into guessing when the admin wants to split on a per-subvolume basis. Still, I do love the polar bears, so I'll take another shot at the btree lock. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in btrfs_uuid_tree_iterate during mount
On Fri, Aug 5, 2016 at 6:12 PM, Chris Mason wrote: > > On 08/05/2016 07:08 AM, Nikolay Borisov wrote: >> Hello, >> >> Any ideas how come btrfs_path can be all zero, the one in >> the first slot comes from the increment in btrfs_next_old_item. > > Thanks for all the extra details. It really must be this: > > if (ret > 0) { > btrfs_release_path(path); > ret = btrfs_uuid_iter_rem(root, uuid, > key.type, >subid_cpu); > if (ret == 0) { > /* > * this might look inefficient, but > the > * justification is that it is an > * exception that check_func returns > 1, > * and that in the regular case only > one > * entry per UUID exists. > */ > goto again_search_slot; > } > if (ret < 0 && ret != -ENOENT) > goto out; > } > item_size -= sizeof(subid_le); > offset += sizeof(subid_le); > > > We've released the path, which would explain why its full of NULL. ret > was ENOENT, so it kept on going, and we fell through to > btrfs_next_item() > > Once the path is released, we should either be searching again or > exiting. A goto again_search_slot would probably fix it, but I'd want > to also bump the key so we don't just process the same item over and > over again. > > Can you reproduce this reliably? I'd hate to patch it now and make more > problems later just because we didn't fully understand the items we were > tripping over. Well there are 2 things I can do: a) Dig more in the crash dump to see whether ret has been saved to the stack and extract the return value. If your theory is correct I should see the value of ENOENT. b) Patch the code to print a warn when btrfs_uuid_iter_rem returns an ENOENT, that way at least we will know that this is happening. In either cases this would take me until at least next week, at which time I should be able to give more information. > > -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
El sábado, 6 de agosto de 2016 0:45:13 (CEST) Tomasz Chmielewski escribió: > And, miracle cure O_o > > # file ./2016-08-02/serverX/syslog.log > ERROR: cannot read `./2016-08-02/serverX/syslog.log' (Input/output > error) > > # echo 3 > /proc/sys/vm/drop_caches > > # file 2016-08-02/serverX/syslog.log > 2016-08-02/serverX/syslog.log: ASCII text, with very long lines FWIW, bugs similar to this one were reported in the past: http://www.spinics.net/lists/linux-btrfs/msg54962.html http://www.spinics.net/lists/linux-btrfs/msg52371.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 2016-08-06 00:45, Tomasz Chmielewski wrote: And, miracle cure O_o # file ./2016-08-02/serverX/syslog.log ERROR: cannot read `./2016-08-02/serverX/syslog.log' (Input/output error) # echo 3 > /proc/sys/vm/drop_caches # file 2016-08-02/serverX/syslog.log 2016-08-02/serverX/syslog.log: ASCII text, with very long lines # cat 2016-08-02/serverX/syslog.log (...) A few mins after the previous "echo 3 > /proc/sys/vm/drop_caches" (this file is around 1.5 MB and wasn't touched since 2016-06-21): # file ./2016-06-21/serverY/nginx-dashboard-error.log ./2016-06-21/serverY/nginx-dashboard-error.log: ERROR: cannot read `./2016-06-21/serverY/nginx-dashboard-error.log' (Input/output error) # echo 3 > /proc/sys/vm/drop_caches # file ./2016-06-21/serverY/nginx-dashboard-error.log ./2016-06-21/serverY/nginx-dashboard-error.log: ASCII text, with very long lines # cat ./2016-06-21/serverY/nginx-dashboard-error.log (...works OK, no corruption...) Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 08/05/2016 11:45 AM, Tomasz Chmielewski wrote: > On 2016-08-06 00:38, Tomasz Chmielewski wrote: > >>> Too big for the known problem though. Still, can you btrfs-debug-tree >>> and just make sure it doesn't have inline items? >> >> Hmmm >> >> # btrfs-debug-tree /dev/xvdb > /root/debug.tree >> parent transid verify failed on 355229302784 wanted 49943295 found >> 49943301 >> parent transid verify failed on 355229302784 wanted 49943295 found >> 49943301 >> Ignoring transid failure >> parent transid verify failed on 355233251328 wanted 49943299 found >> 49943303 >> parent transid verify failed on 355233251328 wanted 49943299 found >> 49943303 >> Ignoring transid failure >> print-tree.c:1105: btrfs_print_tree: Assertion failed. >> btrfs-debug-tree[0x418d99] >> btrfs-debug-tree(btrfs_print_tree+0x26a)[0x41acf6] >> btrfs-debug-tree(main+0x9a5)[0x432589] >> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f2369de0f45] >> btrfs-debug-tree[0x4070e9] > > And, miracle cure O_o > > # file ./2016-08-02/serverX/syslog.log > ERROR: cannot read `./2016-08-02/serverX/syslog.log' (Input/output error) > > # echo 3 > /proc/sys/vm/drop_caches > > # file 2016-08-02/serverX/syslog.log > 2016-08-02/serverX/syslog.log: ASCII text, with very long lines > > # cat 2016-08-02/serverX/syslog.log > (...) > If you don't already have this commit, please give it a try. Should fix things up. commit 8dff9c85341032767d7b519217a79ea04cd676b0 Author: Chris Mason Date: Sat Sep 19 11:28:25 2015 -0700 Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 2016-08-06 00:40, Chris Mason wrote: Too big for the known problem though. Still, can you btrfs-debug-tree and just make sure it doesn't have inline items? Hmmm # btrfs-debug-tree /dev/xvdb > /root/debug.tree parent transid verify failed on 355229302784 wanted 49943295 found 49943301 parent transid verify failed on 355229302784 wanted 49943295 found 49943301 Ignoring transid failure parent transid verify failed on 355233251328 wanted 49943299 found 49943303 parent transid verify failed on 355233251328 wanted 49943299 found 49943303 Ignoring transid failure print-tree.c:1105: btrfs_print_tree: Assertion failed. btrfs-debug-tree[0x418d99] btrfs-debug-tree(btrfs_print_tree+0x26a)[0x41acf6] btrfs-debug-tree(main+0x9a5)[0x432589] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f2369de0f45] btrfs-debug-tree[0x4070e9] Looks like the FS is mounted? It is mounted, yes. Does btrfs-debug-tree need an unmounted FS? I'm not able to unmount it unfortunately (in sense, the system has to work). Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 2016-08-06 00:38, Tomasz Chmielewski wrote: Too big for the known problem though. Still, can you btrfs-debug-tree and just make sure it doesn't have inline items? Hmmm # btrfs-debug-tree /dev/xvdb > /root/debug.tree parent transid verify failed on 355229302784 wanted 49943295 found 49943301 parent transid verify failed on 355229302784 wanted 49943295 found 49943301 Ignoring transid failure parent transid verify failed on 355233251328 wanted 49943299 found 49943303 parent transid verify failed on 355233251328 wanted 49943299 found 49943303 Ignoring transid failure print-tree.c:1105: btrfs_print_tree: Assertion failed. btrfs-debug-tree[0x418d99] btrfs-debug-tree(btrfs_print_tree+0x26a)[0x41acf6] btrfs-debug-tree(main+0x9a5)[0x432589] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f2369de0f45] btrfs-debug-tree[0x4070e9] And, miracle cure O_o # file ./2016-08-02/serverX/syslog.log ERROR: cannot read `./2016-08-02/serverX/syslog.log' (Input/output error) # echo 3 > /proc/sys/vm/drop_caches # file 2016-08-02/serverX/syslog.log 2016-08-02/serverX/syslog.log: ASCII text, with very long lines # cat 2016-08-02/serverX/syslog.log (...) Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 08/05/2016 11:38 AM, Tomasz Chmielewski wrote: On 2016-08-06 00:15, Chris Mason wrote: # cat 2016-08-02/serverX/syslog.log cat: 2016-08-02/serverX/syslog.log: Input/output error How big is the file? We had one bug with inline files that might have caused this. This one's tiny, 158137 bytes. Too big for the known problem though. Still, can you btrfs-debug-tree and just make sure it doesn't have inline items? Hmmm # btrfs-debug-tree /dev/xvdb > /root/debug.tree parent transid verify failed on 355229302784 wanted 49943295 found 49943301 parent transid verify failed on 355229302784 wanted 49943295 found 49943301 Ignoring transid failure parent transid verify failed on 355233251328 wanted 49943299 found 49943303 parent transid verify failed on 355233251328 wanted 49943299 found 49943303 Ignoring transid failure print-tree.c:1105: btrfs_print_tree: Assertion failed. btrfs-debug-tree[0x418d99] btrfs-debug-tree(btrfs_print_tree+0x26a)[0x41acf6] btrfs-debug-tree(main+0x9a5)[0x432589] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f2369de0f45] btrfs-debug-tree[0x4070e9] Looks like the FS is mounted? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 2016-08-06 00:15, Chris Mason wrote: # cat 2016-08-02/serverX/syslog.log cat: 2016-08-02/serverX/syslog.log: Input/output error How big is the file? We had one bug with inline files that might have caused this. This one's tiny, 158137 bytes. Too big for the known problem though. Still, can you btrfs-debug-tree and just make sure it doesn't have inline items? Hmmm # btrfs-debug-tree /dev/xvdb > /root/debug.tree parent transid verify failed on 355229302784 wanted 49943295 found 49943301 parent transid verify failed on 355229302784 wanted 49943295 found 49943301 Ignoring transid failure parent transid verify failed on 355233251328 wanted 49943299 found 49943303 parent transid verify failed on 355233251328 wanted 49943299 found 49943303 Ignoring transid failure print-tree.c:1105: btrfs_print_tree: Assertion failed. btrfs-debug-tree[0x418d99] btrfs-debug-tree(btrfs_print_tree+0x26a)[0x41acf6] btrfs-debug-tree(main+0x9a5)[0x432589] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f2369de0f45] btrfs-debug-tree[0x4070e9] Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 08/05/2016 10:44 AM, Tomasz Chmielewski wrote: On 2016-08-05 23:26, Chris Mason wrote: On 08/05/2016 07:42 AM, Tomasz Chmielewski wrote: I'm getting occasional (every few weeks) input/output errors on a btrfs filesystem with compress-force=zlib, running on Amazon EC2, with 4.5.2 kernel: # cat 2016-08-02/serverX/syslog.log cat: 2016-08-02/serverX/syslog.log: Input/output error How big is the file? We had one bug with inline files that might have caused this. This one's tiny, 158137 bytes. Too big for the known problem though. Still, can you btrfs-debug-tree and just make sure it doesn't have inline items? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Crash in btrfs_uuid_tree_iterate during mount
On 08/05/2016 07:08 AM, Nikolay Borisov wrote: > Hello, > > Recently I started getting the following crashes on some servers, > running btrfs: > > [340435.480338] BTRFS info (device loop7): disk space caching is enabled > [340435.480509] BTRFS: has skinny extents > [340441.716174] BTRFS: checking UUID tree > [340441.912070] BUG: unable to handle kernel NULL pointer dereference at > 0098 > [340441.912463] IP: [] btrfs_uuid_tree_iterate+0xf4/0x2d0 > [btrfs] > [340441.912823] PGD 0 > [340441.913035] Oops: [#1] SMP > [340441.913302] Modules linked in: > [340441.916996] CPU: 10 PID: 24990 Comm: btrfs-uuid Tainted: PW O > 4.4.14-clouder1 #55 > [340441.917287] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.2 > 01/16/2015 > [340441.917573] task: 8801b95c1b80 ti: 88034e504000 task.ti: > 88034e504000 > [340441.917859] RIP: 0010:[] [] > btrfs_uuid_tree_iterate+0xf4/0x2d0 [btrfs] > [340441.918212] RSP: 0018:88034e507e20 EFLAGS: 00010246 > [340441.918382] RAX: RBX: 1600 RCX: > 8800 > [340441.918665] RDX: 0001 RSI: 8801e3abd140 RDI: > 88046f027f00 > [340441.918952] RBP: 88034e507ea8 R08: 60fb80001760 R09: > a07ac1de > [340441.919236] R10: e8d41760 R11: ea00078eaf40 R12: > 8801b98ab750 > [340441.919521] R13: fffe R14: 8801e3abd140 R15: > 880049586000 > [340441.919810] FS: () GS:88047fd4() > knlGS: > [340441.920097] CS: 0010 DS: ES: CR0: 80050033 > [340441.920267] CR2: 0098 CR3: 01c0a000 CR4: > 000406e0 > [340441.920554] Stack: > [340441.920717] 880049586000 8801b98ab750 3f7b00014fc0 > 8803711dec08 > [340441.921186] a07d0c40 880332342000 0114 > 1b7088046d7612f8 > [340441.921655] 8cfb42689378e508 70157e0ade97f5d6 8c42689378e5081b > 15157e0ade97f5d6 > [340441.922126] Call Trace: > [340441.922315] [] ? find_live_mirror.isra.18+0xc0/0xc0 > [btrfs] > [340441.922614] [] ? btrfs_uuid_scan_kthread+0x3c0/0x3c0 > [btrfs] > [340441.922917] [] btrfs_uuid_rescan_kthread+0x1b/0x60 > [btrfs] > [340441.923197] [] kthread+0xef/0x110 > [340441.923363] [] ? kthread_park+0x60/0x60 > [340441.923531] [] ret_from_fork+0x3f/0x70 > [340441.923697] [] ? kthread_park+0x60/0x60 > [340441.923863] Code: 0f 86 a0 00 00 00 48 bb 00 00 00 00 00 16 00 00 41 8b > 44 24 40 48 b9 00 00 00 00 00 88 ff ff 8d 50 01 49 8b 04 24 41 89 54 24 40 > <48> 03 98 98 00 00 00 48 89 d8 48 c1 f8 06 48 c1 e0 0c 3b 54 08 > [340441.927296] RIP [] btrfs_uuid_tree_iterate+0xf4/0x2d0 > [btrfs] > [340441.927641] RSP > [340441.927806] CR2: 0098 > > > a081f774 is in the heavily inlined btrfs_next_item. Here > is the decoded instructions, right before the crash with annotations: > >0: 0f 86 a0 00 00 00 jbe0xa6 >6: 48 bb 00 00 00 00 00mov$0x1600,%rbx >d: 16 00 00 > 10: 41 8b 44 24 40 mov0x40(%r12),%eax ; r12 is btrfs_path, eax > points to first slot > 15: 48 b9 00 00 00 00 00mov$0x8800,%rcx > 1c: 88 ff ff > 1f: 8d 50 01lea0x1(%rax),%edx ; incr slot > 22: 49 8b 04 24 mov(%r12),%rax ; load first extent_buffer > in rax > 26: 41 89 54 24 40 mov%edx,0x40(%r12) ; save incremented slot > 2b:*48 03 98 98 00 00 00add0x98(%rax),%rbx <-- trapping > instruction ; load the first page from the extent_buffer > 32: 48 89 d8mov%rbx,%rax > 35: 48 c1 f8 06 sar$0x6,%rax > 39: 48 c1 e0 0c shl$0xc,%rax > 3d: 3b .byte 0x3b > 3e: 54 push %rsp > 3f: 08 .byte 0x8 > > So as can be seen rax is zero and naturally dereferencing it is > also zero. What's interesting is the content of the btrf_path: > > struct btrfs_path { > nodes = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, > slots = {1, 0, 0, 0, 0, 0, 0, 0}, > locks = {0, 0, 0, 0, 0, 0, 0, 0}, > reada = 0, > lowest_level = 0, > search_for_split = 0, > keep_locks = 0, > skip_locking = 0, > leave_spinning = 0, > search_commit_root = 0, > need_commit_sem = 0, > skip_release_on_error = 0 > } > > Any ideas how come btrfs_path can be all zero, the one in > the first slot comes from the increment in btrfs_next_old_item. Thanks for all the extra details. It really must be this: if (ret > 0) { btrfs_release_path(path); ret = btrfs_uuid_iter_rem(root, uuid, key.type, subid_cpu); if (ret == 0) { /* * this might look inefficient, but the
Re: Input/output error, nothing appended in dmesg
On 2016-08-05 23:26, Chris Mason wrote: On 08/05/2016 07:42 AM, Tomasz Chmielewski wrote: I'm getting occasional (every few weeks) input/output errors on a btrfs filesystem with compress-force=zlib, running on Amazon EC2, with 4.5.2 kernel: # cat 2016-08-02/serverX/syslog.log cat: 2016-08-02/serverX/syslog.log: Input/output error How big is the file? We had one bug with inline files that might have caused this. This one's tiny, 158137 bytes. Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Input/output error, nothing appended in dmesg
On 08/05/2016 07:42 AM, Tomasz Chmielewski wrote: I'm getting occasional (every few weeks) input/output errors on a btrfs filesystem with compress-force=zlib, running on Amazon EC2, with 4.5.2 kernel: # cat 2016-08-02/serverX/syslog.log cat: 2016-08-02/serverX/syslog.log: Input/output error How big is the file? We had one bug with inline files that might have caused this. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem
On 08/05/2016 02:12 PM, Austin S. Hemmelgarn wrote: > If you stick to single disk We do, all our btrfs filesystems reside on one single block device, redundancy is provided by a DRBD layer below. > don't use quota groups We don't use any quotas. > stick to reasonably sized filesystems (not more than a few TB) We do, currently 4 TB max, because that's the only way to utilize different physical storage devices for different filesystem instances such that we can backup them in parallel within reasonable time. > and avoid a couple of specific unconventional storage configurations below it Configurations like what? > The whole issue with > databases is often a non-issue for desktop users in my experience Well, try a "cat" on a sqlite file that has been used by some ordinary desktop software (like a browser) for a year - and you'll experience horrible performance, due to the extreme amount of fragments. (Having to manually "de-fragment" a filesystem periodically is something that I had considered a thing of the past when I started using BSD's hfs instead of the Amiga FFS in the late 1980s... ;-) > and if you think VM image > performance is bad, you should really be looking at using real block storage instead of a file > (seriously, this will usually get you a bigger performance boost than using ext4 or XFS > over BTRFS as an underlying filesystem will). Sure, assigning block devices to each VM would be even better, but also much less convenient for operations. It's a feature here that any user can start a new VM instance (without root privileges) at any time, and that the images used by those VMs are part of the incremental backup that stores only differences, not "whole files that have been changed". >> We sure do - actually, the possibility to "run daily backups from a >> snapshot while write performance remains acceptable" is the one and >> only reason for me to use btrfs rather than xfs for those $HOME dirs. >> In every other aspect (stability, performance, suitability for >> storing VM-images or database-files) xfs wins for me. >> And the btrfs advantage "file system based snapshot being more >> performant than block device based snapshot" may fade away >> with the replacement of magnetic disks with SSDs in the long run. > I'm going to respond to the two parts of this separately: > 1. As far as snapshot performance, you'd be surprised. I've got pretty good consumer grade SSD's > that can do a sustained 250MB/s write speed, which means that to be as fast as a snapshot, > the data set would have to be less than 25MB No, I'm talking about LVM snapshots, which utilitze Copy-On-Write on the block device level. Creating such an LVM snapshot is as quick as creating a btrfs snapshot, regardless of the size. The only significant draw-back of the LVM snapshot is that whenever data is written to the filesystem, that causes copy operations from one part of the (currently magnetic) storage to another part, and that seriously hurts the write performance. (Of course, it would not be a reasonable option to take a block device snapshot by first copying all the data on it.) > 2. As far as snapshots being the only advantage of BTRFS, that's just bogus. > XFS does have metadata checksumming now, but that provides no protection for > data, just metadata. We check for bit-rot on the block device level, DRBD verifies the integrity of the data by reading from both redundant storage devices and comparing the checksums, periodically every week. So far, we never encountered a single bit-rot error, even though the underlying physical storage devices are "cheap SATA disks". > XFS also doesn't have transparent compression support I have no use for that. Disk space is relatively cheap, cheap enough that we don't bother with RAID-5 or such, but use the "full redundancy" provided by a shared-nothing DRBD setup. > filesystems can't be shrunk I enlarged XFS filesystems multiple times while in use, which worked well. I never had to shrink a filesystem, and I cannot imagine how such a use case could occur to me. > and it stores no backups of any metadata except super-blocks. Which is fine with me, as redundancy is provided on the block device level by DRBD. > While the compression and filesystem shrinking may not be needed in > your use case, the data integrity features are almost certainly an advantage. Btrfs sure has some nifty features, and I understand that for some stuff like "subvolumes" or "deduplication" are important. But a hundred great features cannot make up for a lack of stability, therefore I would love to see those ENOSPC-related issues to be resolved rather than more fancy features being built :-) Regards, Lutz Vieweg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem
On 2016-08-05 06:56, Lutz Vieweg wrote: On 08/04/2016 10:30 PM, Chris Murphy wrote: Keep in mind the list is rather self-selecting for problems. People who aren't having problems are unlikely to post their non-problems to the list. True, but the number of people inclined to post a bug report to the list is also a lot smaller than the number of people who experienced problems. Personally, I know at least 2 Linux users who happened to get a btrfs filesystem as part of upgrading to a newer Suse distribution on their PC, and both of them experienced trouble with their filesystems that caused them to re-install without using btrfs. They weren't interested in what filesystem they use enough to bother investigating what happened in detail or to issue bug-reports. I'm afraid that btrfs' reputation has already taken damage from the combination of "early deployment as a root filesystem to unsuspecting users" and "being at a development stage where users are likely to experience trouble at some time". FWIW, the 'early deployment' thing is an issue of the distributions themselves, and most people who have come to me personally complaining about BTRFS have understood this after I explained it to them. As far as the rest, it's hit or miss whether you have issues. I've been using BTRFS on all my personal systems since about 3.14, and have had zero issues with data loss or filesystem corruption (or horrible performance) since about 3.18 that were actually BTRFS issues (it's helped me ID a lot of marginal hardware though), and in fact, I had more issues trying to use ZFS for a year than I've had in the now multiple years of using BTRFS, and in the case of BTRFS, I was actually able to fix things. I know quite a few people (and a number of big companies for that matter) who have been running BTRFS for longer and had fewer issues too. The biggest issue is that the risks involved aren't well characterized, although most filesystems have that same issue. If you stick to single disk or raid1 mode, don't use quota groups (which at least SUSE does by default now), stick to reasonably sized filesystems (not more than a few TB), and avoid a couple of specific unconventional storage configurations below it, BTRFS works fine. The whole issue with databases is often a non-issue for desktop users in my experience, and if you think VM image performance is bad, you should really be looking at using real block storage instead of a file (seriously, this will usually get you a bigger performance boost than using ext4 or XFS over BTRFS as an underlying filesystem will). c. Take some risk and use 4.8 rc1 once it's out. Just make sure to keep backups. We sure do - actually, the possibility to "run daily backups from a snapshot while write performance remains acceptable" is the one and only reason for me to use btrfs rather than xfs for those $HOME dirs. In every other aspect (stability, performance, suitability for storing VM-images or database-files) xfs wins for me. And the btrfs advantage "file system based snapshot being more performant than block device based snapshot" may fade away with the replacement of magnetic disks with SSDs in the long run. I'm going to respond to the two parts of this separately: 1. As far as snapshot performance, you'd be surprised. I've got pretty good consumer grade SSD's that can do a sustained 250MB/s write speed, which means that to be as fast as a snapshot, the data set would have to be less than 25MB (and that's being generous, snapshots usually take less than 0.1s to create on my system). Where the turnover point occurs varies of course based on storage bandwidth, but I don't see it being very likely that SSD's will obsolete snapshotting any time soon. Even if disks suddenly get the ability to run at full bandwidth of the link they're on, a SAS3 disk (12Gbit/s signaling, practical bandwidth of about 1GB/s) would have a turn over point of about 100MB, and a NVMe device on a PCIe 4.0 X16 link (3.151GB/s theoretical bandwidth) would have a turn over point of 3.1GB. In theory, a high-end NVDIMM might be able to do better than a snapshot, but it probably couldn't get much faster right now than twice the speed of a PCIe 4.0 X16 link, which means that it would likely have a turn over point of about 6.2GB. In comparison, it's not unusual to need a snapshot of a data set in excess of a terabyte in size. 2. As far as snapshots being the only advantage of BTRFS, that's just bogus. XFS does have metadata checksumming now, but that provides no protection for data, just metadata. XFS also doesn't have transparent compression support, filesystems can't be shrunk, and it stores no backups of any metadata except super-blocks. While the compression and filesystem shrinking may not be needed in your use case, the data integrity features are almost certainly an advantage. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
Input/output error, nothing appended in dmesg
I'm getting occasional (every few weeks) input/output errors on a btrfs filesystem with compress-force=zlib, running on Amazon EC2, with 4.5.2 kernel: # cat 2016-08-02/serverX/syslog.log cat: 2016-08-02/serverX/syslog.log: Input/output error Strangely, nothing gets appended in dmesg: # dmesg -c # The filesystem stores mostly remote syslog files (so, all text files, appended to). Expected? # btrfs fi show /var/log/remote/ Label: none uuid: 5cec93a8-7894-41f6-94a4-9d9b58216dd4 Total devices 1 FS bytes used 146.55GiB devid1 size 200.00GiB used 153.01GiB path /dev/xvdb # btrfs fi df /var/log/remote/ Data, single: total=149.00GiB, used=144.50GiB System, single: total=4.00MiB, used=48.00KiB Metadata, single: total=4.01GiB, used=2.05GiB GlobalReserve, single: total=512.00MiB, used=0.00B Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to stress test raid6 on 122 disk array
On 2016-08-04 17:12, Chris Murphy wrote: On Thu, Aug 4, 2016 at 2:51 PM, Martin wrote: Thanks for the benchmark tools and tips on where the issues might be. Is Fedora 24 rawhide preferred over ArchLinux? I'm not sure what Arch does any differently to their kernels from kernel.org kernels. But bugzilla.kernel.org offers a Mainline and Fedora drop down for identifying the kernel source tree. IIRC, they're pretty close to mainline kernels. I don't think they have any patches in the filesystem or block layer code at least, but I may be wrong, it's been a long time since I looked at an Arch kernel. If I want to compile a mainline kernel. Are there anything I need to tune? Fedora kernels do not have these options set. # CONFIG_BTRFS_FS_CHECK_INTEGRITY is not set # CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set # CONFIG_BTRFS_DEBUG is not set # CONFIG_BTRFS_ASSERT is not set The sanity and integrity tests are both compile time and mount time options, i.e. it has to be compiled enabled for the mount option to do anything. I can't recall any thread where a developer asked a user to set any of these options for testing though. FWIW, I actually have the integrity checking code built in on most kernels I build. I don't often use it, but it has near zero overhead when not enabled, and it's helped me track down lower-level storage configuration issues on occasion. When I do the tests, how do I log the info you would like to see, if I find a bug? bugzilla.kernel.org for tracking, and then reference the URL for the bug with a summary in an email to list is how I usually do it. The main thing is going to be the exact reproduce steps. It's also better, I think, to have complete dmesg (or journalctl -k) attached to the bug report because not all problems are directly related to Btrfs, they can have contributing factors elsewhere. And various MTAs, or more commonly MUAs, have a tendancy to wrap such wide text as found in kernel or journald messages. Aside from kernel messages, the other general stuff you want to have is: 1. Kernel version and userspace tools version (`uname -a` and `btrfs --version`) 2. Any underlying storage configuration if it's not just plain a SSD/HDD or partitions (for example, usage of dm-crypt, LVM, mdadm, and similar things). 3. Output from `btrfs filesystem show` (this can be trimmed to the filesystem that's having the issue). 4. If you can still mount the filesystem, `btrfs filesystem df` output can be helpful. 5. If you can't mount the filesystem, output from `btrfs check` run without any options will usually be asked for. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Crash in btrfs_uuid_tree_iterate during mount
Hello, Recently I started getting the following crashes on some servers, running btrfs: [340435.480338] BTRFS info (device loop7): disk space caching is enabled [340435.480509] BTRFS: has skinny extents [340441.716174] BTRFS: checking UUID tree [340441.912070] BUG: unable to handle kernel NULL pointer dereference at 0098 [340441.912463] IP: [] btrfs_uuid_tree_iterate+0xf4/0x2d0 [btrfs] [340441.912823] PGD 0 [340441.913035] Oops: [#1] SMP [340441.913302] Modules linked in: [340441.916996] CPU: 10 PID: 24990 Comm: btrfs-uuid Tainted: PW O 4.4.14-clouder1 #55 [340441.917287] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.2 01/16/2015 [340441.917573] task: 8801b95c1b80 ti: 88034e504000 task.ti: 88034e504000 [340441.917859] RIP: 0010:[] [] btrfs_uuid_tree_iterate+0xf4/0x2d0 [btrfs] [340441.918212] RSP: 0018:88034e507e20 EFLAGS: 00010246 [340441.918382] RAX: RBX: 1600 RCX: 8800 [340441.918665] RDX: 0001 RSI: 8801e3abd140 RDI: 88046f027f00 [340441.918952] RBP: 88034e507ea8 R08: 60fb80001760 R09: a07ac1de [340441.919236] R10: e8d41760 R11: ea00078eaf40 R12: 8801b98ab750 [340441.919521] R13: fffe R14: 8801e3abd140 R15: 880049586000 [340441.919810] FS: () GS:88047fd4() knlGS: [340441.920097] CS: 0010 DS: ES: CR0: 80050033 [340441.920267] CR2: 0098 CR3: 01c0a000 CR4: 000406e0 [340441.920554] Stack: [340441.920717] 880049586000 8801b98ab750 3f7b00014fc0 8803711dec08 [340441.921186] a07d0c40 880332342000 0114 1b7088046d7612f8 [340441.921655] 8cfb42689378e508 70157e0ade97f5d6 8c42689378e5081b 15157e0ade97f5d6 [340441.922126] Call Trace: [340441.922315] [] ? find_live_mirror.isra.18+0xc0/0xc0 [btrfs] [340441.922614] [] ? btrfs_uuid_scan_kthread+0x3c0/0x3c0 [btrfs] [340441.922917] [] btrfs_uuid_rescan_kthread+0x1b/0x60 [btrfs] [340441.923197] [] kthread+0xef/0x110 [340441.923363] [] ? kthread_park+0x60/0x60 [340441.923531] [] ret_from_fork+0x3f/0x70 [340441.923697] [] ? kthread_park+0x60/0x60 [340441.923863] Code: 0f 86 a0 00 00 00 48 bb 00 00 00 00 00 16 00 00 41 8b 44 24 40 48 b9 00 00 00 00 00 88 ff ff 8d 50 01 49 8b 04 24 41 89 54 24 40 <48> 03 98 98 00 00 00 48 89 d8 48 c1 f8 06 48 c1 e0 0c 3b 54 08 [340441.927296] RIP [] btrfs_uuid_tree_iterate+0xf4/0x2d0 [btrfs] [340441.927641] RSP [340441.927806] CR2: 0098 a081f774 is in the heavily inlined btrfs_next_item. Here is the decoded instructions, right before the crash with annotations: 0: 0f 86 a0 00 00 00 jbe0xa6 6: 48 bb 00 00 00 00 00mov$0x1600,%rbx d: 16 00 00 10: 41 8b 44 24 40 mov0x40(%r12),%eax ; r12 is btrfs_path, eax points to first slot 15: 48 b9 00 00 00 00 00mov$0x8800,%rcx 1c: 88 ff ff 1f: 8d 50 01lea0x1(%rax),%edx ; incr slot 22: 49 8b 04 24 mov(%r12),%rax ; load first extent_buffer in rax 26: 41 89 54 24 40 mov%edx,0x40(%r12) ; save incremented slot 2b:* 48 03 98 98 00 00 00add0x98(%rax),%rbx <-- trapping instruction ; load the first page from the extent_buffer 32: 48 89 d8mov%rbx,%rax 35: 48 c1 f8 06 sar$0x6,%rax 39: 48 c1 e0 0c shl$0xc,%rax 3d: 3b .byte 0x3b 3e: 54 push %rsp 3f: 08 .byte 0x8 So as can be seen rax is zero and naturally dereferencing it is also zero. What's interesting is the content of the btrf_path: struct btrfs_path { nodes = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, slots = {1, 0, 0, 0, 0, 0, 0, 0}, locks = {0, 0, 0, 0, 0, 0, 0, 0}, reada = 0, lowest_level = 0, search_for_split = 0, keep_locks = 0, skip_locking = 0, leave_spinning = 0, search_commit_root = 0, need_commit_sem = 0, skip_release_on_error = 0 } Any ideas how come btrfs_path can be all zero, the one in the first slot comes from the increment in btrfs_next_old_item. Regards, Nikolay -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 6TB partition, Data only 2TB - aka When you haven't hit the "usual" problem
On 08/04/2016 10:30 PM, Chris Murphy wrote: Keep in mind the list is rather self-selecting for problems. People who aren't having problems are unlikely to post their non-problems to the list. True, but the number of people inclined to post a bug report to the list is also a lot smaller than the number of people who experienced problems. Personally, I know at least 2 Linux users who happened to get a btrfs filesystem as part of upgrading to a newer Suse distribution on their PC, and both of them experienced trouble with their filesystems that caused them to re-install without using btrfs. They weren't interested in what filesystem they use enough to bother investigating what happened in detail or to issue bug-reports. I'm afraid that btrfs' reputation has already taken damage from the combination of "early deployment as a root filesystem to unsuspecting users" and "being at a development stage where users are likely to experience trouble at some time". c. Take some risk and use 4.8 rc1 once it's out. Just make sure to keep backups. We sure do - actually, the possibility to "run daily backups from a snapshot while write performance remains acceptable" is the one and only reason for me to use btrfs rather than xfs for those $HOME dirs. In every other aspect (stability, performance, suitability for storing VM-images or database-files) xfs wins for me. And the btrfs advantage "file system based snapshot being more performant than block device based snapshot" may fade away with the replacement of magnetic disks with SSDs in the long run. Regards, Lutz Vieweg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check btree node's nritems
On 08/05/16 11:24, Holger Hoffstätte wrote: > On Wed, 03 Aug 2016 12:57:28 -0700, Liu Bo wrote: > >> When btree node (level = 1) has nritems which equals to zero, >> we can end up with panic due to insert_ptr()'s >> >> BUG_ON(slot > nritems); >> >> where slot is 1 and nritems is 0, as copy_for_split() calls >> insert_ptr(.., path->slots[1] + 1, ...); >> >> A invalid value results in the whole mess, this adds the check >> for btree's node nritems so that we stop reading block when >> when something is wrong. >> >> Signed-off-by: Liu Bo >> --- >> fs/btrfs/disk-io.c | 17 + >> 1 file changed, 17 insertions(+) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 37d1780..a5a22be 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -612,6 +612,20 @@ static noinline int check_leaf(struct btrfs_root *root, >> return 0; >> } >> >> +static noinline int check_node(struct btrfs_root *root, >> + struct extent_buffer *node) >> +{ >> +unsigned long nr = btrfs_header_nritems(node); >> + >> +if (nr <= 0 || nr >= BTRFS_NODEPTRS_PER_BLOCK(root)) { >> +btrfs_crit(root->fs_info, >> + "corrupt node: block %llu root %llu nritems %lu\n", > > I think the trailing \n can be dropped here, btrfs_crit() already provides > a proper newline. On top of that I get a whole bunch of false positives with this patch. Files that are perfectly readable without it now error out, in which case the logged nritems is always 493 - regardless of file or containing subvolume. Something is fishy here. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to stress test raid6 on 122 disk array
Martin writes: > The smallest disk of the 122 is 500GB. Is it possible to have btrfs > see each disk as only e.g. 10GB? That way I can corrupt and resilver > more disks over a month. Well, at least you can easily partition the devices for that to happen. However, I would also suggest that would it be more useful use of the resource to run many arrays in parallel? Ie. one 6-device raid6, one 20-device raid6, and then perhaps use the rest of the devices for a very large btrfs filesystem? Or if you have been using partitioning the large btrfs volume can also be composed of all the 122 devices; in fact you could even run multiple 122-device raid6s and use different kind of testing on each. For performance testing you might only excert one of the file systems at a time, though. -- _ / __// /__ __ http://www.modeemi.fi/~flux/\ \ / /_ / // // /\ \/ /\ / /_/ /_/ \___/ /_/\_\@modeemi.fi \/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check btree node's nritems
On Wed, 03 Aug 2016 12:57:28 -0700, Liu Bo wrote: > When btree node (level = 1) has nritems which equals to zero, > we can end up with panic due to insert_ptr()'s > > BUG_ON(slot > nritems); > > where slot is 1 and nritems is 0, as copy_for_split() calls > insert_ptr(.., path->slots[1] + 1, ...); > > A invalid value results in the whole mess, this adds the check > for btree's node nritems so that we stop reading block when > when something is wrong. > > Signed-off-by: Liu Bo > --- > fs/btrfs/disk-io.c | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 37d1780..a5a22be 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -612,6 +612,20 @@ static noinline int check_leaf(struct btrfs_root *root, > return 0; > } > > +static noinline int check_node(struct btrfs_root *root, > +struct extent_buffer *node) > +{ > + unsigned long nr = btrfs_header_nritems(node); > + > + if (nr <= 0 || nr >= BTRFS_NODEPTRS_PER_BLOCK(root)) { > + btrfs_crit(root->fs_info, > +"corrupt node: block %llu root %llu nritems %lu\n", I think the trailing \n can be dropped here, btrfs_crit() already provides a proper newline. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic: test accurate shared extent reporting
On Fri, Aug 05, 2016 at 01:02:12AM -0700, Darrick J. Wong wrote: > On Fri, Aug 05, 2016 at 03:46:07PM +0800, Eryu Guan wrote: > > On Fri, Aug 05, 2016 at 12:21:47AM -0700, Darrick J. Wong wrote: > > > +_count_holes $testdir/file2 > > > +echo "file1 shared extents" > > > +$XFS_IO_PROG -c 'fiemap -v' $testdir/file1 | awk '{print $5}' | grep > > > '0x.*[2367aAbBfF]...$' -c > > > > Missing a command at the end? > > Nope, it echoes the number of shared extents (that's what that awk and grep > globule does), which /should/ be exactly 2. > > (Unless I'm missing something?) Ah, thanks! I saw "-c" at the end and thought it was part of xfs_io command without looking at it carefully. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] generic: test accurate shared extent reporting
Ensure that we can create a file with a single extent, reflink two blocks out of the middle of that extent, and the resulting fiemap reports two shared extents, instead of lazily reporting the entire huge extent as shared. v2: add _supported_fs Signed-off-by: Darrick J. Wong --- tests/generic/929 | 90 + tests/generic/929.out | 17 + tests/generic/group |1 + 3 files changed, 108 insertions(+) create mode 100755 tests/generic/929 create mode 100644 tests/generic/929.out diff --git a/tests/generic/929 b/tests/generic/929 new file mode 100755 index 000..1871789 --- /dev/null +++ b/tests/generic/929 @@ -0,0 +1,90 @@ +#! /bin/bash +# FS QA Test No. 929 +# +# Check that bmap/fiemap accurately report shared extents. +# +#--- +# Copyright (c) 2016 Oracle, Inc. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 7 15 + +_cleanup() +{ + cd / + rm -rf $tmp.* + wait +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink + +# real QA test starts here +_supported_os Linux +_supported_fs generic +_require_scratch_reflink +_require_fiemap + +echo "Format and mount" +_scratch_mkfs > $seqres.full 2>&1 +_scratch_mount >> $seqres.full 2>&1 + +testdir=$SCRATCH_MNT/test-$seq +mkdir $testdir + +blocks=5 +blksz=65536 +sz=$((blocks * blksz)) + +echo "Create the original files" +$XFS_IO_PROG -f -c "falloc 0 $sz" $testdir/file1 >> $seqres.full +_pwrite_byte 0x61 0 $sz $testdir/file1 >> $seqres.full +_scratch_cycle_mount + +echo "file1 extents and holes" +_count_extents $testdir/file1 +_count_holes $testdir/file1 + +_reflink_range $testdir/file1 $blksz $testdir/file2 $((blksz * 3)) $blksz >> $seqres.full +_reflink_range $testdir/file1 $((blksz * 3)) $testdir/file2 $blksz $blksz >> $seqres.full +_scratch_cycle_mount + +echo "Compare files" +md5sum $testdir/file1 | _filter_scratch +md5sum $testdir/file2 | _filter_scratch + +echo "file1 extents and holes" +_count_extents $testdir/file1 +_count_holes $testdir/file1 +echo "file2 extents and holes" +_count_extents $testdir/file2 +_count_holes $testdir/file2 +echo "file1 shared extents" +$XFS_IO_PROG -c 'fiemap -v' $testdir/file1 | awk '{print $5}' | grep -c '0x.*[2367aAbBfF]...$' + +# success, all done +status=0 +exit diff --git a/tests/generic/929.out b/tests/generic/929.out new file mode 100644 index 000..e290f4c --- /dev/null +++ b/tests/generic/929.out @@ -0,0 +1,17 @@ +QA output created by 929 +Format and mount +Create the original files +file1 extents and holes +1 +0 +Compare files +17af09af790a9b4c79cddf72f6b642cb SCRATCH_MNT/test-929/file1 +79418df9c55ab7f58781cb7b9e7d5d91 SCRATCH_MNT/test-929/file2 +file1 extents and holes +5 +0 +file2 extents and holes +2 +2 +file1 shared extents +2 diff --git a/tests/generic/group b/tests/generic/group index 18b9775..732f6f6 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -375,3 +375,4 @@ 370 auto quick richacl 927 auto quick clone 928 auto quick clone dedupe +929 auto quick clone -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic: test accurate shared extent reporting
On Fri, Aug 05, 2016 at 03:46:07PM +0800, Eryu Guan wrote: > On Fri, Aug 05, 2016 at 12:21:47AM -0700, Darrick J. Wong wrote: > > Ensure that we can create a file with a single extent, reflink two > > blocks out of the middle of that extent, and the resulting fiemap > > reports two shared extents, instead of lazily reporting the entire > > huge extent as shared. > > > > Signed-off-by: Darrick J. Wong > > --- > > tests/generic/929 | 89 > > + > > tests/generic/929.out | 17 + > > tests/generic/group |1 + > > 3 files changed, 107 insertions(+) > > create mode 100755 tests/generic/929 > > create mode 100644 tests/generic/929.out > > > > diff --git a/tests/generic/929 b/tests/generic/929 > > new file mode 100755 > > index 000..9793be0 > > --- /dev/null > > +++ b/tests/generic/929 > > @@ -0,0 +1,89 @@ > > +#! /bin/bash > > +# FS QA Test No. 929 > > +# > > +# Check that bmap/fiemap accurately report shared extents. > > +# > > +#--- > > +# Copyright (c) 2016 Oracle, Inc. All Rights Reserved. > > +# > > +# This program is free software; you can redistribute it and/or > > +# modify it under the terms of the GNU General Public License as > > +# published by the Free Software Foundation. > > +# > > +# This program is distributed in the hope that it would be useful, > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > +# GNU General Public License for more details. > > +# > > +# You should have received a copy of the GNU General Public License > > +# along with this program; if not, write the Free Software Foundation, > > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > > +#--- > > +# > > + > > +seq=`basename $0` > > +seqres=$RESULT_DIR/$seq > > +echo "QA output created by $seq" > > + > > +here=`pwd` > > +tmp=/tmp/$$ > > +status=1 # failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 7 15 > > + > > +_cleanup() > > +{ > > + cd / > > + rm -rf $tmp.* > > + wait > > +} > > + > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter > > +. ./common/reflink > > + > > +# real QA test starts here > > +_supported_os Linux > > Need "_supported_fs generic" Ok. > > +_require_scratch_reflink > > +_require_fiemap > > + > > +echo "Format and mount" > > +_scratch_mkfs > $seqres.full 2>&1 > > +_scratch_mount >> $seqres.full 2>&1 > > + > > +testdir=$SCRATCH_MNT/test-$seq > > +mkdir $testdir > > + > > +blocks=5 > > +blksz=65536 > > +sz=$((blocks * blksz)) > > + > > +echo "Create the original files" > > +$XFS_IO_PROG -f -c "falloc 0 $sz" $testdir/file1 >> $seqres.full > > +_pwrite_byte 0x61 0 $sz $testdir/file1 >> $seqres.full > > +_scratch_cycle_mount > > + > > +echo "file1 extents and holes" > > +_count_extents $testdir/file1 > > +_count_holes $testdir/file1 > > + > > +_reflink_range $testdir/file1 $blksz $testdir/file2 $((blksz * 3)) $blksz > > >> $seqres.full > > +_reflink_range $testdir/file1 $((blksz * 3)) $testdir/file2 $blksz $blksz > > >> $seqres.full > > +_scratch_cycle_mount > > + > > +echo "Compare files" > > +md5sum $testdir/file1 | _filter_scratch > > +md5sum $testdir/file2 | _filter_scratch > > + > > +echo "file1 extents and holes" > > +_count_extents $testdir/file1 > > +_count_holes $testdir/file1 > > +echo "file2 extents and holes" > > +_count_extents $testdir/file2 > > +_count_holes $testdir/file2 > > +echo "file1 shared extents" > > +$XFS_IO_PROG -c 'fiemap -v' $testdir/file1 | awk '{print $5}' | grep > > '0x.*[2367aAbBfF]...$' -c > > Missing a command at the end? Nope, it echoes the number of shared extents (that's what that awk and grep globule does), which /should/ be exactly 2. (Unless I'm missing something?) --D > > Thanks, > Eryu > > > + > > +# success, all done > > +status=0 > > +exit > > diff --git a/tests/generic/929.out b/tests/generic/929.out > > new file mode 100644 > > index 000..e290f4c > > --- /dev/null > > +++ b/tests/generic/929.out > > @@ -0,0 +1,17 @@ > > +QA output created by 929 > > +Format and mount > > +Create the original files > > +file1 extents and holes > > +1 > > +0 > > +Compare files > > +17af09af790a9b4c79cddf72f6b642cb SCRATCH_MNT/test-929/file1 > > +79418df9c55ab7f58781cb7b9e7d5d91 SCRATCH_MNT/test-929/file2 > > +file1 extents and holes > > +5 > > +0 > > +file2 extents and holes > > +2 > > +2 > > +file1 shared extents > > +2 > > diff --git a/tests/generic/group b/tests/generic/group > > index 18b9775..732f6f6 100644 > > --- a/tests/generic/group > > +++ b/tests/generic/group > > @@ -375,3 +375,4 @@ > > 370 auto quick richacl > > 927 auto quick clone > > 928 auto quick clone dedupe > > +929 auto quick clone > -- > To unsubscribe from this list: send the line "unsubscribe fstests" in >
BTRFS: Transaction aborted (ENOSPC)
B.H. Hello. I have a setup with 4 RAID10 arrays, 4 drives each (using md). Device usage is as follows: # btrfs device usage /storage/bkp1 /dev/md1, ID: 1 Device size:10.92TiB Device slack: 0.00B Data,single:10.19TiB Metadata,RAID1:199.00GiB System,RAID1:8.00MiB Unallocated: 542.79GiB /dev/md2, ID: 2 Device size:10.92TiB Device slack: 0.00B Data,single:10.21TiB Metadata,RAID1:181.00GiB System,RAID1:8.00MiB Unallocated: 541.80GiB /dev/md3, ID: 3 Device size:10.92TiB Device slack: 0.00B Data,single:10.41TiB Metadata,RAID1: 65.00GiB Unallocated: 457.81GiB /dev/md4, ID: 4 Device size:10.92TiB Device slack: 0.00B Data,single: 9.89TiB Metadata,RAID1: 89.00GiB Unallocated: 959.81GiB Mount options: compress=zlib,commit=60,noatime This setup is used to store regular backups from 2 different sites (each on different subvolume with regular snapshots). The backup is done using rsync as the source storage is using xfs not btrfs. This setup has been working excellently for about 7 months. Curently, it has about 100 snapshots in total. Recently, i've started to face problems with "transaction aborted" messages and volume going read-only. This happens unexpectedly, after several hours of rsync running. As a precursor, it throws several warnings about tasks blocked for 120 seconds, this is probably connected to a long time required to commit transaction. After transaction abort, i reboot the server, then restart the backup and it seems to continue OK until the next crash. Scrub didn't find and errors on the volume. I'm unable to run btrfs check as it consumes all of the RAM and crashes. Any suggestions what's going wrong and how to fix this? # uname -a Linux yemot-4u 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux # btrfs --version btrfs-progs v4.7 Thanks in advance! -- משיח NOW! Moshiach is coming very soon, prepare yourself! יחי אדוננו מורינו ורבינו מלך המשיח לעולם ועד! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] generic: test accurate shared extent reporting
On Fri, Aug 05, 2016 at 12:21:47AM -0700, Darrick J. Wong wrote: > Ensure that we can create a file with a single extent, reflink two > blocks out of the middle of that extent, and the resulting fiemap > reports two shared extents, instead of lazily reporting the entire > huge extent as shared. > > Signed-off-by: Darrick J. Wong > --- > tests/generic/929 | 89 > + > tests/generic/929.out | 17 + > tests/generic/group |1 + > 3 files changed, 107 insertions(+) > create mode 100755 tests/generic/929 > create mode 100644 tests/generic/929.out > > diff --git a/tests/generic/929 b/tests/generic/929 > new file mode 100755 > index 000..9793be0 > --- /dev/null > +++ b/tests/generic/929 > @@ -0,0 +1,89 @@ > +#! /bin/bash > +# FS QA Test No. 929 > +# > +# Check that bmap/fiemap accurately report shared extents. > +# > +#--- > +# Copyright (c) 2016 Oracle, Inc. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 7 15 > + > +_cleanup() > +{ > + cd / > + rm -rf $tmp.* > + wait > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/reflink > + > +# real QA test starts here > +_supported_os Linux Need "_supported_fs generic" > +_require_scratch_reflink > +_require_fiemap > + > +echo "Format and mount" > +_scratch_mkfs > $seqres.full 2>&1 > +_scratch_mount >> $seqres.full 2>&1 > + > +testdir=$SCRATCH_MNT/test-$seq > +mkdir $testdir > + > +blocks=5 > +blksz=65536 > +sz=$((blocks * blksz)) > + > +echo "Create the original files" > +$XFS_IO_PROG -f -c "falloc 0 $sz" $testdir/file1 >> $seqres.full > +_pwrite_byte 0x61 0 $sz $testdir/file1 >> $seqres.full > +_scratch_cycle_mount > + > +echo "file1 extents and holes" > +_count_extents $testdir/file1 > +_count_holes $testdir/file1 > + > +_reflink_range $testdir/file1 $blksz $testdir/file2 $((blksz * 3)) $blksz >> > $seqres.full > +_reflink_range $testdir/file1 $((blksz * 3)) $testdir/file2 $blksz $blksz >> > $seqres.full > +_scratch_cycle_mount > + > +echo "Compare files" > +md5sum $testdir/file1 | _filter_scratch > +md5sum $testdir/file2 | _filter_scratch > + > +echo "file1 extents and holes" > +_count_extents $testdir/file1 > +_count_holes $testdir/file1 > +echo "file2 extents and holes" > +_count_extents $testdir/file2 > +_count_holes $testdir/file2 > +echo "file1 shared extents" > +$XFS_IO_PROG -c 'fiemap -v' $testdir/file1 | awk '{print $5}' | grep > '0x.*[2367aAbBfF]...$' -c Missing a command at the end? Thanks, Eryu > + > +# success, all done > +status=0 > +exit > diff --git a/tests/generic/929.out b/tests/generic/929.out > new file mode 100644 > index 000..e290f4c > --- /dev/null > +++ b/tests/generic/929.out > @@ -0,0 +1,17 @@ > +QA output created by 929 > +Format and mount > +Create the original files > +file1 extents and holes > +1 > +0 > +Compare files > +17af09af790a9b4c79cddf72f6b642cb SCRATCH_MNT/test-929/file1 > +79418df9c55ab7f58781cb7b9e7d5d91 SCRATCH_MNT/test-929/file2 > +file1 extents and holes > +5 > +0 > +file2 extents and holes > +2 > +2 > +file1 shared extents > +2 > diff --git a/tests/generic/group b/tests/generic/group > index 18b9775..732f6f6 100644 > --- a/tests/generic/group > +++ b/tests/generic/group > @@ -375,3 +375,4 @@ > 370 auto quick richacl > 927 auto quick clone > 928 auto quick clone dedupe > +929 auto quick clone -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] generic: test accurate shared extent reporting
Ensure that we can create a file with a single extent, reflink two blocks out of the middle of that extent, and the resulting fiemap reports two shared extents, instead of lazily reporting the entire huge extent as shared. Signed-off-by: Darrick J. Wong --- tests/generic/929 | 89 + tests/generic/929.out | 17 + tests/generic/group |1 + 3 files changed, 107 insertions(+) create mode 100755 tests/generic/929 create mode 100644 tests/generic/929.out diff --git a/tests/generic/929 b/tests/generic/929 new file mode 100755 index 000..9793be0 --- /dev/null +++ b/tests/generic/929 @@ -0,0 +1,89 @@ +#! /bin/bash +# FS QA Test No. 929 +# +# Check that bmap/fiemap accurately report shared extents. +# +#--- +# Copyright (c) 2016 Oracle, Inc. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 7 15 + +_cleanup() +{ + cd / + rm -rf $tmp.* + wait +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink + +# real QA test starts here +_supported_os Linux +_require_scratch_reflink +_require_fiemap + +echo "Format and mount" +_scratch_mkfs > $seqres.full 2>&1 +_scratch_mount >> $seqres.full 2>&1 + +testdir=$SCRATCH_MNT/test-$seq +mkdir $testdir + +blocks=5 +blksz=65536 +sz=$((blocks * blksz)) + +echo "Create the original files" +$XFS_IO_PROG -f -c "falloc 0 $sz" $testdir/file1 >> $seqres.full +_pwrite_byte 0x61 0 $sz $testdir/file1 >> $seqres.full +_scratch_cycle_mount + +echo "file1 extents and holes" +_count_extents $testdir/file1 +_count_holes $testdir/file1 + +_reflink_range $testdir/file1 $blksz $testdir/file2 $((blksz * 3)) $blksz >> $seqres.full +_reflink_range $testdir/file1 $((blksz * 3)) $testdir/file2 $blksz $blksz >> $seqres.full +_scratch_cycle_mount + +echo "Compare files" +md5sum $testdir/file1 | _filter_scratch +md5sum $testdir/file2 | _filter_scratch + +echo "file1 extents and holes" +_count_extents $testdir/file1 +_count_holes $testdir/file1 +echo "file2 extents and holes" +_count_extents $testdir/file2 +_count_holes $testdir/file2 +echo "file1 shared extents" +$XFS_IO_PROG -c 'fiemap -v' $testdir/file1 | awk '{print $5}' | grep '0x.*[2367aAbBfF]...$' -c + +# success, all done +status=0 +exit diff --git a/tests/generic/929.out b/tests/generic/929.out new file mode 100644 index 000..e290f4c --- /dev/null +++ b/tests/generic/929.out @@ -0,0 +1,17 @@ +QA output created by 929 +Format and mount +Create the original files +file1 extents and holes +1 +0 +Compare files +17af09af790a9b4c79cddf72f6b642cb SCRATCH_MNT/test-929/file1 +79418df9c55ab7f58781cb7b9e7d5d91 SCRATCH_MNT/test-929/file2 +file1 extents and holes +5 +0 +file2 extents and holes +2 +2 +file1 shared extents +2 diff --git a/tests/generic/group b/tests/generic/group index 18b9775..732f6f6 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -375,3 +375,4 @@ 370 auto quick richacl 927 auto quick clone 928 auto quick clone dedupe +929 auto quick clone -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html