Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
[Resending in plain text, apologies.] Hi Chandan, Josef, Chris, I am not sure I understand the fix to the problem. It may happen that when updating the device tree, we need to allocate a new chunk via do_chunk_alloc (while we are holding the device tree root node locked). This is a legitimate thing for find_free_extent() to do. And do_chunk_alloc() call may lead to call to btrfs_create_pending_block_groups(), which will try to update the device tree. This may happen due to direct call to btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or perhaps by __btrfs_end_transaction() that find_free_extent() does after it completed chunk allocation (although in this case it will use the transaction that already exists in current->journal_info). So the deadlock still may happen? Thanks, Alex. > > > On Mon, Nov 2, 2015 at 6:52 PM, Chris Masonwrote: >> >> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote: >> > When executing generic/001 in a loop on a ppc64 machine (with both >> > sectorsize >> > and nodesize set to 64k), the following call trace is observed, >> >> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes. >> >> -chris >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
Hi Filipe Manana, My understanding of selecting delayed refs to run or merging them is far from complete. Can you please explain what will happen in the following scenario: 1) Ref1 is created, as you explain 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end up with an EXTENT_ITEM and an inline extent back ref 3) Ref2 and Ref3 are added 4) Somebody calls __btrfs_run_delayed_refs() At this point, we cannot merge Ref2 and Ref3, because they might be referencing tree blocks of completely different trees, thus comp_tree_refs() will return 1 or -1. But we will select Ref3 to be run, because we prefer BTRFS_ADD_DELAYED_REF over BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON now, because we already have Ref1 in the extent tree. So something should prevent us from running Ref3 before running Ref2. We should run Ref2 first, which should get rid of the EXTENT_ITEM and the inline backref, and then run Ref3 to create a new backref with a proper owner. What is that something? Can you please point me at what am I missing? Also, can such scenario happen in 3.18 kernel, which still has an rbtree per ref-head? Looking at the code, I don't see anything preventing that from happening. Thanks, Alex. On Sun, Oct 25, 2015 at 8:51 PM,wrote: > From: Filipe Manana > > In the kernel 4.2 merge window we had a refactoring/rework of the delayed > references implementation in order to fix certain problems with qgroups. > However that rework introduced one more regression that leads to the > following trace when running delayed references for metadata: > > [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! > [35908.065201] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor > raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc > loop fuse parport_pc psmouse i2 > [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW > 4.3.0-rc5-btrfs-next-17+ #1 > [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 > [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] > [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: > 88010c4c8000 > [35908.065201] RIP: 0010:[] [] > insert_inline_extent_backref+0x52/0xb1 [btrfs] > [35908.065201] RSP: 0018:88010c4cbb08 EFLAGS: 00010293 > [35908.065201] RAX: RBX: 88008a661000 RCX: > > [35908.065201] RDX: a04dd58f RSI: 0001 RDI: > > [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: > 88010c4cb9f8 > [35908.065201] R10: R11: 002c R12: > > [35908.065201] R13: 88020a74c578 R14: R15: > > [35908.065201] FS: () GS:88023edc() > knlGS: > [35908.065201] CS: 0010 DS: ES: CR0: 8005003b > [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: > 06e0 > [35908.065201] Stack: > [35908.065201] 88010c4cbb18 0f37 88020a74c578 > 88015a408000 > [35908.065201] 880154a44000 0005 > 88010c4cbbd8 > [35908.065201] a0492b9a 0005 > > [35908.065201] Call Trace: > [35908.065201] [] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs] > [35908.065201] [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 > [btrfs] > [35908.065201] [] __btrfs_run_delayed_refs+0xafa/0xd33 > [btrfs] > [35908.065201] [] ? join_transaction.isra.10+0x25/0x41f > [btrfs] > [35908.065201] [] ? join_transaction.isra.10+0xa8/0x41f > [btrfs] > [35908.065201] [] btrfs_run_delayed_refs+0x75/0x1dd [btrfs] > [35908.065201] [] delayed_ref_async_start+0x3c/0x7b [btrfs] > [35908.065201] [] normal_work_helper+0x14c/0x32a [btrfs] > [35908.065201] [] btrfs_extent_refs_helper+0x12/0x14 > [btrfs] > [35908.065201] [] process_one_work+0x24a/0x4ac > [35908.065201] [] worker_thread+0x206/0x2c2 > [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb > [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb > [35908.065201] [] kthread+0xef/0xf7 > [35908.065201] [] ? kthread_parkme+0x24/0x24 > [35908.065201] [] ret_from_fork+0x3f/0x70 > [35908.065201] [] ? kthread_parkme+0x24/0x24 > [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d > 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> > 0b 4c 8b 45 30 8b 4d 28 45 31 > [35908.065201] RIP [] > insert_inline_extent_backref+0x52/0xb1 [btrfs] > [35908.065201] RSP > [35908.310885] ---[ end trace fe4299baf0666457 ]--- > > This happens because the new delayed references code no longer merges > delayed references that have different sequence values.
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Hi Filipe Manana, Can't the call to btrfs_create_pending_block_groups() cause a deadlock, like in http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this call updates the device tree, and we may be calling do_chunk_alloc() from find_free_extent() when holding a lock on the device tree root (because we want to COW a block of the device tree). My understanding from Josef's chunk allocator rework (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now when allocating a new chunk we do not immediately update the device/chunk tree. We keep the new chunk in "pending_chunks" and in "new_bgs" on a transaction handle, and we actually update the chunk/device tree only when we are done with a particular transaction handle. This way we avoid that sort of deadlocks. But this patch breaks this rule, as it may make us update the device/chunk tree in the context of chunk allocation, which is the scenario that the rework was meant to avoid. Can you please point me at what I am missing? Thanks, Alex. On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandovalwrote: > On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when >> finishing block group creation"), introduced in 4.2-rc1, the following >> test was failing due to exhaustion of the system array in the superblock: >> >> #!/bin/bash >> >> truncate -s 100T big.img >> mkfs.btrfs big.img >> mount -o loop big.img /mnt/loop >> >> num=5 >> sz=10T >> for ((i = 0; i < $num; i++)); do >> echo fallocate $i $sz >> fallocate -l $sz /mnt/loop/testfile$i >> done >> btrfs filesystem sync /mnt/loop >> >> for ((i = 0; i < $num; i++)); do >> echo rm $i >> rm /mnt/loop/testfile$i >> btrfs filesystem sync /mnt/loop >> done >> umount /mnt/loop >> >> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive >> allocation of system block groups. This happened because the test creates >> a large number of data block groups per transaction and when committing >> the transaction we start the writeout of the block group caches for all >> the new new (dirty) block groups, which results in pre-allocating space >> for each block group's free space cache using the same transaction handle. >> That in turn often leads to creation of more block groups, and all get >> attached to the new_bgs list of the same transaction handle to the point >> of getting a list with over 1500 elements, and creation of new block groups >> leads to the need of reserving space in the chunk block reserve and often >> creating a new system block group too. >> >> So that made us quickly exhaust the chunk block reserve/system space info, >> because as of the commit mentioned before, we do reserve space for each >> new block group in the chunk block reserve, unlike before where we would >> not and would at most allocate one new system block group and therefore >> would only ensure that there was enough space in the system space info to >> allocate 1 new block group even if we ended up allocating thousands of >> new block groups using the same transaction handle. That worked most of >> the time because the computed required space at check_system_chunk() is >> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and >> that all nodes/leafs in a path will be COWed and split) and since the >> updates to the chunk tree all happen at btrfs_create_pending_block_groups >> it is unlikely that a path needs to be COWed more than once (unless >> writepages() for the btree inode is called by mm in between) and that >> compensated for the need of creating any new nodes/leads in the chunk >> tree. >> >> So fix this by ensuring we don't accumulate a too large list of new block >> groups in a transaction's handles new_bgs list, inserting/updating the >> chunk tree for all accumulated new block groups and releasing the unused >> space from the chunk block reserve whenever the list becomes sufficiently >> large. This is a generic solution even though the problem currently can >> only happen when starting the writeout of the free space caches for all >> dirty block groups (btrfs_start_dirty_block_groups()). >> >> Reported-by: Omar Sandoval >> Signed-off-by: Filipe Manana > > Thanks a lot for taking a look. > > Tested-by: Omar Sandoval > >> --- >> fs/btrfs/extent-tree.c | 18 ++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 171312d..07204bf 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4227,6 +4227,24 @@ out: >> space_info->chunk_alloc = 0; >> spin_unlock(_info->lock); >> mutex_unlock(_info->chunk_mutex); >> + /* >> + * When we allocate a new chunk we reserve space in the chunk block >> + * reserve
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
On Wed, 2015-12-09 at 13:36 +, Duncan wrote: > Answering the BTW first, not to my knowledge, and I'd be > skeptical. In > general, btrfs is cowed, and that's the focus. To the extent that > nocow > is necessary for fragmentation/performance reasons, etc, the idea is > to > try to make cow work better in those cases, for example by working on > autodefrag to make it better at handling large files without the > scaling > issues it currently has above half a gig or so, and thus to confine > nocow > to a smaller and smaller niche use-case, rather than focusing on > making > nocow better. > Of course it remains to be seen how much better they can do with > autodefrag, etc, but at this point, there's way more project > possibilities than people to develop them, so even if they do find > they > can't make cow work much better for these cases, actually working on > nocow > would still be rather far down the list, because there's so many > other > improvement and feature opportunities that will get the focus > first. > Which in practice probably puts it in "it'd be nice, but it's low > enough > priority that we're talking five years out or more, unless of course > someone else qualified steps up and that's their personal itch they > want > to scratch", territory. I guess I'll split out my answer on that, in a fresh thread about checksums for nodatacow later, hoping to attract some more devs there :-) I think however, again with my naive understanding on how CoW works and what it inherently implies, that there cannot be a real good solution to the fragmentation problem for DB/etc. files. And as such, I'd think that having a checksumming feature for notdatacow as well, even if it's not perfect, is definitely worth it. > As for the updated checksum after modification, the problem with that > is > that in the mean time, the checksum wouldn't verify, Well one could either implement some locking,.. but I don't see the general problem here... if the block is still being written (and I count updating the meta-data, including checksum, to that) it cannot be read anyway, can it? It may be only half written and the data returned would be garbage. > and while btrfs > could of course keep status in memory during normal operations, > that's > not the problem, the problem is what happens if there's a crash and > in- > memory state vaporizes. In that case, when btrfs remounted, it'd > have no > way of knowing why the checksum didn't match, just that it didn't, > and > would then refuse access to that block in the file, because for all > it > knows, it /is/ a block error. And this would only happen in the rare cases that anything crashes, where it's anyway quite likely that this no-CoWed block will be garbage. I'll talk about that more in the separate thread... so let's move things there. > Same here. In fact, my most anticipated feature is N-way-mirroring, Hmm ... not totally sure about that... AFAIU, N-way-mirroring is what currently the currently wrongly called RAID1 is in btrfs, i.e. having N replicas of everything on M devices, right? In other words, not being a N-parity-RAID and not guaranteeing that *any* N disks could fail, right? Hmm I guess that would be definitely nice to have, especially since then we could have true RAID1, i.e. N=M. But it's probably rather important for those scenarios, where either resilience matters a lot... and/or those where write speed doesn't but read speed does, right? Taking the example of our use case at the university, i.e. the LHC Tier-2 we run,... that would rather be uninteresting. We typically have storage nodes (and many of them) of say 16-24 devices, and based on funding constraints, resilience concerns and IO performance, we place them in RAID6 (yeah i know, RAID5 is faster, but even with hotspares in place, practise lead too often to lost RAIDs). Especially for the bigger nodes, with more disks, we'd rather have a N- parity RAID, where any N disks can fail)... of course performance considerations may kill that desire again ;) > It is a big and basic feature, but turning it off isn't the end of > the > world, because then it's still the same level of reliability other > solutions such as raid generally provide. Sure... I never meant it as "loss to what we already have in other systems"... but as "loss compared to how awesome[0] btrfs could be ;-)" > But as it happens, both VM image management and databases tend to > come > with their own integrity management, in part precisely because the > filesystem could never provide that sort of service. Well that's only partially true, to my knowledge. a) I wouldn't know that hypervisors do that at all. b) DBs have of course their journal, but that protects only against crashes,... not against bad blocks nor does it help you to decide which block is good when you have multiple. > After all, you can always decide not to run it if you're worried > about the space effects it's going to have
Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakaswrote: > Hi Filipe Manana, > > My understanding of selecting delayed refs to run or merging them is > far from complete. Can you please explain what will happen in the > following scenario: > > 1) Ref1 is created, as you explain > 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end > up with an EXTENT_ITEM and an inline extent back ref > 3) Ref2 and Ref3 are added > 4) Somebody calls __btrfs_run_delayed_refs() > > At this point, we cannot merge Ref2 and Ref3, because they might be > referencing tree blocks of completely different trees, thus > comp_tree_refs() will return 1 or -1. But we will select Ref3 to be > run, because we prefer BTRFS_ADD_DELAYED_REF over > BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON > now, because we already have Ref1 in the extent tree. No, that won't happen. If the ref (Ref3) is for a different tree, than it has a different inline extent from Ref1 (lookup_inline_extent_backref returns -ENOENT and not 0). If they are all for the same tree it means Ref3 is not merged with Ref2 because they have different seq numbers and a seq value exist in fs_info->tree_mod_seq_list, and we skip Ref3 through btrfs_check_delayed_seq() until such seq number goes away from tree_mod_seq_list. If no seq number exists in tree_mod_seq_list then we merge it (Ref3) through btrfs_merge_delayed_refs(), called when running delayed refs, with Ref2 (which removes both refs since one is "-1" and the other "+1"). Iow, after this regression fix, no behaviour changed from releases before 4.2. > > So something should prevent us from running Ref3 before running Ref2. > We should run Ref2 first, which should get rid of the EXTENT_ITEM and > the inline backref, and then run Ref3 to create a new backref with a > proper owner. What is that something? > > Can you please point me at what am I missing? > > Also, can such scenario happen in 3.18 kernel, which still has an > rbtree per ref-head? Looking at the code, I don't see anything > preventing that from happening. > > Thanks, > Alex. > > > On Sun, Oct 25, 2015 at 8:51 PM, wrote: >> From: Filipe Manana >> >> In the kernel 4.2 merge window we had a refactoring/rework of the delayed >> references implementation in order to fix certain problems with qgroups. >> However that rework introduced one more regression that leads to the >> following trace when running delayed references for metadata: >> >> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! >> [35908.065201] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC >> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor >> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache >> sunrpc loop fuse parport_pc psmouse i2 >> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW >>4.3.0-rc5-btrfs-next-17+ #1 >> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 >> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] >> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: >> 88010c4c8000 >> [35908.065201] RIP: 0010:[] [] >> insert_inline_extent_backref+0x52/0xb1 [btrfs] >> [35908.065201] RSP: 0018:88010c4cbb08 EFLAGS: 00010293 >> [35908.065201] RAX: RBX: 88008a661000 RCX: >> >> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: >> >> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: >> 88010c4cb9f8 >> [35908.065201] R10: R11: 002c R12: >> >> [35908.065201] R13: 88020a74c578 R14: R15: >> >> [35908.065201] FS: () GS:88023edc() >> knlGS: >> [35908.065201] CS: 0010 DS: ES: CR0: 8005003b >> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: >> 06e0 >> [35908.065201] Stack: >> [35908.065201] 88010c4cbb18 0f37 88020a74c578 >> 88015a408000 >> [35908.065201] 880154a44000 0005 >> 88010c4cbbd8 >> [35908.065201] a0492b9a 0005 >> >> [35908.065201] Call Trace: >> [35908.065201] [] __btrfs_inc_extent_ref+0x8b/0x208 >> [btrfs] >> [35908.065201] [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 >> [btrfs] >> [35908.065201] [] __btrfs_run_delayed_refs+0xafa/0xd33 >> [btrfs] >> [35908.065201] [] ? join_transaction.isra.10+0x25/0x41f >> [btrfs] >> [35908.065201] [] ? join_transaction.isra.10+0xa8/0x41f >> [btrfs] >> [35908.065201] [] btrfs_run_delayed_refs+0x75/0x1dd >> [btrfs] >> [35908.065201] [] delayed_ref_async_start+0x3c/0x7b >> [btrfs] >> [35908.065201] []
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakaswrote: > Hi Filipe Manana, > > Can't the call to btrfs_create_pending_block_groups() cause a > deadlock, like in > http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this > call updates the device tree, and we may be calling do_chunk_alloc() > from find_free_extent() when holding a lock on the device tree root > (because we want to COW a block of the device tree). > > My understanding from Josef's chunk allocator rework > (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now > when allocating a new chunk we do not immediately update the > device/chunk tree. We keep the new chunk in "pending_chunks" and in > "new_bgs" on a transaction handle, and we actually update the > chunk/device tree only when we are done with a particular transaction > handle. This way we avoid that sort of deadlocks. > > But this patch breaks this rule, as it may make us update the > device/chunk tree in the context of chunk allocation, which is the > scenario that the rework was meant to avoid. > > Can you please point me at what I am missing? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff > > Thanks, > Alex. > > > On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval wrote: >> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: >>> From: Filipe Manana >>> >>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when >>> finishing block group creation"), introduced in 4.2-rc1, the following >>> test was failing due to exhaustion of the system array in the superblock: >>> >>> #!/bin/bash >>> >>> truncate -s 100T big.img >>> mkfs.btrfs big.img >>> mount -o loop big.img /mnt/loop >>> >>> num=5 >>> sz=10T >>> for ((i = 0; i < $num; i++)); do >>> echo fallocate $i $sz >>> fallocate -l $sz /mnt/loop/testfile$i >>> done >>> btrfs filesystem sync /mnt/loop >>> >>> for ((i = 0; i < $num; i++)); do >>> echo rm $i >>> rm /mnt/loop/testfile$i >>> btrfs filesystem sync /mnt/loop >>> done >>> umount /mnt/loop >>> >>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive >>> allocation of system block groups. This happened because the test creates >>> a large number of data block groups per transaction and when committing >>> the transaction we start the writeout of the block group caches for all >>> the new new (dirty) block groups, which results in pre-allocating space >>> for each block group's free space cache using the same transaction handle. >>> That in turn often leads to creation of more block groups, and all get >>> attached to the new_bgs list of the same transaction handle to the point >>> of getting a list with over 1500 elements, and creation of new block groups >>> leads to the need of reserving space in the chunk block reserve and often >>> creating a new system block group too. >>> >>> So that made us quickly exhaust the chunk block reserve/system space info, >>> because as of the commit mentioned before, we do reserve space for each >>> new block group in the chunk block reserve, unlike before where we would >>> not and would at most allocate one new system block group and therefore >>> would only ensure that there was enough space in the system space info to >>> allocate 1 new block group even if we ended up allocating thousands of >>> new block groups using the same transaction handle. That worked most of >>> the time because the computed required space at check_system_chunk() is >>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and >>> that all nodes/leafs in a path will be COWed and split) and since the >>> updates to the chunk tree all happen at btrfs_create_pending_block_groups >>> it is unlikely that a path needs to be COWed more than once (unless >>> writepages() for the btree inode is called by mm in between) and that >>> compensated for the need of creating any new nodes/leads in the chunk >>> tree. >>> >>> So fix this by ensuring we don't accumulate a too large list of new block >>> groups in a transaction's handles new_bgs list, inserting/updating the >>> chunk tree for all accumulated new block groups and releasing the unused >>> space from the chunk block reserve whenever the list becomes sufficiently >>> large. This is a generic solution even though the problem currently can >>> only happen when starting the writeout of the free space caches for all >>> dirty block groups (btrfs_start_dirty_block_groups()). >>> >>> Reported-by: Omar Sandoval >>> Signed-off-by: Filipe Manana >> >> Thanks a lot for taking a look. >> >> Tested-by: Omar Sandoval >> >>> --- >>> fs/btrfs/extent-tree.c | 18 ++ >>> 1 file changed, 18 insertions(+) >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 171312d..07204bf
Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
Hi Filipe, Thank you for the explanation. On Sun, Dec 13, 2015 at 5:43 PM, Filipe Mananawrote: > On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas wrote: >> Hi Filipe Manana, >> >> My understanding of selecting delayed refs to run or merging them is >> far from complete. Can you please explain what will happen in the >> following scenario: >> >> 1) Ref1 is created, as you explain >> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end >> up with an EXTENT_ITEM and an inline extent back ref >> 3) Ref2 and Ref3 are added >> 4) Somebody calls __btrfs_run_delayed_refs() >> >> At this point, we cannot merge Ref2 and Ref3, because they might be >> referencing tree blocks of completely different trees, thus >> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be >> run, because we prefer BTRFS_ADD_DELAYED_REF over >> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON >> now, because we already have Ref1 in the extent tree. > > No, that won't happen. If the ref (Ref3) is for a different tree, than > it has a different inline extent from Ref1 > (lookup_inline_extent_backref returns -ENOENT and not 0). Understood. So in this case, we will first add inline ref for Ref3, and later drop the Ref1 inline ref via update_inline_extent_backref() by truncating the EXTENT_ITEM. All in the same transaction. > > If they are all for the same tree it means Ref3 is not merged with > Ref2 because they have different seq numbers and a seq value exist in > fs_info->tree_mod_seq_list, and we skip Ref3 through > btrfs_check_delayed_seq() until such seq number goes away from > tree_mod_seq_list. Ok, so we won't process this ref-head at all, until the "seq problem" disappears. > If no seq number exists in tree_mod_seq_list then > we merge it (Ref3) through btrfs_merge_delayed_refs(), called when > running delayed refs, with Ref2 (which removes both refs since one is > "-1" and the other "+1"). So in this case we don't care that the inline ref we have in the EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully equivalent. Interesting. Thanks! Alex. > > Iow, after this regression fix, no behaviour changed from releases before 4.2. > >> >> So something should prevent us from running Ref3 before running Ref2. >> We should run Ref2 first, which should get rid of the EXTENT_ITEM and >> the inline backref, and then run Ref3 to create a new backref with a >> proper owner. What is that something? >> >> Can you please point me at what am I missing? >> >> Also, can such scenario happen in 3.18 kernel, which still has an >> rbtree per ref-head? Looking at the code, I don't see anything >> preventing that from happening. >> >> Thanks, >> Alex. >> >> >> On Sun, Oct 25, 2015 at 8:51 PM, wrote: >>> From: Filipe Manana >>> >>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed >>> references implementation in order to fix certain problems with qgroups. >>> However that rework introduced one more regression that leads to the >>> following trace when running delayed references for metadata: >>> >>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! >>> [35908.065201] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC >>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor >>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache >>> sunrpc loop fuse parport_pc psmouse i2 >>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW >>> 4.3.0-rc5-btrfs-next-17+ #1 >>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 >>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] >>> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: >>> 88010c4c8000 >>> [35908.065201] RIP: 0010:[] [] >>> insert_inline_extent_backref+0x52/0xb1 [btrfs] >>> [35908.065201] RSP: 0018:88010c4cbb08 EFLAGS: 00010293 >>> [35908.065201] RAX: RBX: 88008a661000 RCX: >>> >>> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: >>> >>> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: >>> 88010c4cb9f8 >>> [35908.065201] R10: R11: 002c R12: >>> >>> [35908.065201] R13: 88020a74c578 R14: R15: >>> >>> [35908.065201] FS: () GS:88023edc() >>> knlGS: >>> [35908.065201] CS: 0010 DS: ES: CR0: 8005003b >>> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: >>> 06e0 >>> [35908.065201] Stack: >>> [35908.065201] 88010c4cbb18 0f37 88020a74c578
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Thank you, Filipe. Now it is more clear. Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling btrfs_create_pending_block_groups(), so I cannot hit this deadlock. But can hit the issue that this call is meant to fix. Thanks, Alex. On Sun, Dec 13, 2015 at 5:45 PM, Filipe Mananawrote: > On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas wrote: >> Hi Filipe Manana, >> >> Can't the call to btrfs_create_pending_block_groups() cause a >> deadlock, like in >> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this >> call updates the device tree, and we may be calling do_chunk_alloc() >> from find_free_extent() when holding a lock on the device tree root >> (because we want to COW a block of the device tree). >> >> My understanding from Josef's chunk allocator rework >> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now >> when allocating a new chunk we do not immediately update the >> device/chunk tree. We keep the new chunk in "pending_chunks" and in >> "new_bgs" on a transaction handle, and we actually update the >> chunk/device tree only when we are done with a particular transaction >> handle. This way we avoid that sort of deadlocks. >> >> But this patch breaks this rule, as it may make us update the >> device/chunk tree in the context of chunk allocation, which is the >> scenario that the rework was meant to avoid. >> >> Can you please point me at what I am missing? > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff > >> >> Thanks, >> Alex. >> >> >> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval wrote: >>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: From: Filipe Manana Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when finishing block group creation"), introduced in 4.2-rc1, the following test was failing due to exhaustion of the system array in the superblock: #!/bin/bash truncate -s 100T big.img mkfs.btrfs big.img mount -o loop big.img /mnt/loop num=5 sz=10T for ((i = 0; i < $num; i++)); do echo fallocate $i $sz fallocate -l $sz /mnt/loop/testfile$i done btrfs filesystem sync /mnt/loop for ((i = 0; i < $num; i++)); do echo rm $i rm /mnt/loop/testfile$i btrfs filesystem sync /mnt/loop done umount /mnt/loop This made btrfs_add_system_chunk() fail with -EFBIG due to excessive allocation of system block groups. This happened because the test creates a large number of data block groups per transaction and when committing the transaction we start the writeout of the block group caches for all the new new (dirty) block groups, which results in pre-allocating space for each block group's free space cache using the same transaction handle. That in turn often leads to creation of more block groups, and all get attached to the new_bgs list of the same transaction handle to the point of getting a list with over 1500 elements, and creation of new block groups leads to the need of reserving space in the chunk block reserve and often creating a new system block group too. So that made us quickly exhaust the chunk block reserve/system space info, because as of the commit mentioned before, we do reserve space for each new block group in the chunk block reserve, unlike before where we would not and would at most allocate one new system block group and therefore would only ensure that there was enough space in the system space info to allocate 1 new block group even if we ended up allocating thousands of new block groups using the same transaction handle. That worked most of the time because the computed required space at check_system_chunk() is very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and that all nodes/leafs in a path will be COWed and split) and since the updates to the chunk tree all happen at btrfs_create_pending_block_groups it is unlikely that a path needs to be COWed more than once (unless writepages() for the btree inode is called by mm in between) and that compensated for the need of creating any new nodes/leads in the chunk tree. So fix this by ensuring we don't accumulate a too large list of new block groups in a transaction's handles new_bgs list, inserting/updating the chunk tree for all accumulated new block groups and releasing the unused space from the chunk block reserve whenever the list becomes sufficiently large. This is a generic solution even though the problem currently can only happen when starting the writeout of the free space caches for all dirty block groups
Re: Very various speed of grep operation on btrfs partition
Ok, I am make another experiment. I am buy new HDD and format it with btrfs file system. Also I increased size of grep data and make bash script wich automate testing: #!/bin/bash #For testing on windows machine #grep_path='/cygdrive/e/Sources/inside' #For testing on new HDD #grep_path='/run/media/mikhail/eaa531cd-25f4-4e00-b31f-22665faa9768/sources/inside' #For testing in real life grep_path='/home/mikhail/sources/inside' command="grep -rn 'float:left;display: block;height: 24px;line-height: 1.2em;position: relative;text-align: center;white-space: nowrap;width: 80px;' '$grep_path'" log_file='res.log' exec 3>&1 1>>${log_file} 2>&1 while [ 1 = 1 ] do (( count++ )) echo "PASS: $count" at `date +"%T"` | tee /dev/fd/3 echo $command | tee /dev/fd/3 eval "{ time $command > /dev/null; } |& tee /dev/fd/3" done And get very interesting results: Linux btrfs with NEW HDD: 6.441s (result as in syntetic tests) Linux btrfs with real data HDD (used 94%): 16m52.036s Very bad why??? Data are same with first variant. Windows ntfs NEW HDD: 1m27.643s I am really disappointed why in real life (home folder) have so bad results It's possible HDD which is used 94% optimise speed as on empty hard drive? Both hard disk are same. This is ST4000NM0033-9ZM170. -- Best Regards, Mike Gavrilov.
Determine is file a reflink or not
Is there a way to view the CoW structure, e.g. to know is file just a reflink or it was modified? I copied many files from snapshot with --reflink=always I and want to know which files was modified since the copying. Calculating hashsums seems to be too long thing. -- Ivan Sizov -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
On Sun, Dec 13, 2015 at 11:35:08PM +0100, Martin Steigerwald wrote: > Hi! > > For me it is still not production ready. Again I ran into: > > btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random > write into big file > https://bugzilla.kernel.org/show_bug.cgi?id=90401 Sorry you're having issues. I haven't seen this before myself. I couldn't find the kernel version you're using in your Email or the bug you filed (quick scan). That's kind of important :) Marc > No matter whether SLES 12 uses it as default for root, no matter whether > Fujitsu and Facebook use it: I will not let this onto any customer machine > without lots and lots of underprovisioning and rigorous free space > monitoring. > Actually I will renew my recommendations in my trainings to be careful with > BTRFS. > > From my experience the monitoring would check for: > > merkaba:~> btrfs fi show /home > Label: 'home' uuid: […] > Total devices 2 FS bytes used 156.31GiB > devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home > devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home > > If "used" is same as "size" then make big fat alarm. It is not sufficient for > it to happen. It can run for quite some time just fine without any issues, > but > I never have seen a kworker thread using 100% of one core for extended period > of time blocking everything else on the fs without this condition being met. > > > In addition to that last time I tried it aborts scrub any of my BTRFS > filesstems. Reported in another thread here that got completely ignored so > far. I think I could go back to 4.2 kernel to make this work. > > > I am not going to bother to go into more detail on any on this, as I get the > impression that my bug reports and feedback get ignored. So I spare myself > the > time to do this work for now. > > > Only thing I wonder now whether this all could be cause my /home is already > more than one and a half year old. Maybe newly created filesystems are > created > in a way that prevents these issues? But it already has a nice global reserve: > > merkaba:~> btrfs fi df / > Data, RAID1: total=27.98GiB, used=24.07GiB > System, RAID1: total=19.00MiB, used=16.00KiB > Metadata, RAID1: total=2.00GiB, used=536.80MiB > GlobalReserve, single: total=192.00MiB, used=0.00B > > > Actually when I see that this free space thing is still not fixed for good I > wonder whether it is fixable at all. Is this an inherent issue of BTRFS or > more generally COW filesystem design? > > I think it got somewhat better. It took much longer to come into that state > again than last time, but still, blocking like this is *no* option for a > *production ready* filesystem. > > > > I am seriously consider to switch to XFS for my production laptop again. > Cause > I never saw any of these free space issues with any of the XFS or Ext4 > filesystems I used in the last 10 years. > > Thanks, > -- > Martin > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: attacking btrfs filesystems via UUID collisions?
On Fri, 2015-12-11 at 16:06 -0700, Chris Murphy wrote: > For anything but a new and empty Btrfs volume What's the influence of the fs being new/empty? > this hypothetical > attack would be a ton easier to do on LVM and mdadm raid because they > have a tiny amount of metadata to spoof compared to a Btrfs volume > with even a little bit of data on it. Uhm I haven't said that other systems properly handle this kind of attack. ;-) Guess that would need to be evaluated... > I think this concern is overblown. I don't think so. Let me give you an example: There is an attack[0] against crypto, where the attacker listens via a smartphone's microphone, and based on the acoustics of a computer where gnupg runs. This is surely not an attack many people would have considered even remotely possible, but in fact it works, at least under lab conditions. I guess the same applies for possible attack vectors like this here. The stronger actual crypto and the strong software gets in terms of classical security holes (buffer overruns and so), the more attackers will try to go alternative ways. > I'm suggesting bitwise identical copies being created is not what is > wanted most of the time, except in edge cases. mhh,.. well there's the VM case, e.g. duplicating a template VM, booting it deploying software. Guess that's already common enough. There are people who want to use btrfs on top of LVM and using the snapshot functionality of that... another use case. Some people may want to use it on top of MD (for whatever reason)... at least in the mirroring RAID case, the kernel would see the same btrfs twice. Apart from that, btrfs should be a general purpose fs, and not just a desktop or server fs. So edge cases like forensics (where it's common that you create bitwise identical images) shouln't be forgotten either. > > >If your workflow requires making an exact copy (for the shelf or > > > for > > > an emergency) then dd might be OK. But most often it's used > > > because > > > it's been easy, not because it's a good practice. > > Ufff.. I wouldn't got that far to call something here bad or good > > practice. > > It's not just bad practice, it's sufficiently sloppy that it's very > nearly user sabotage. That this is due to innocent ignorance, and a > long standing practice that's bad advice being handed down from > previous generations doesn't absolve the practice and mean we should > invent esoteric work arounds for what is not a good practice. We have > all sorts of exhibits why it's not a good idea. Well if you don't give any real arguments or technical reasons (apart from "working around software that doesn't handle this well") I consider this just repetition of the baseless claim that long standing practise would be bad. > I disagree. It was due to the rudimentary nature of earlier > filesystems' metadata paradigm that it worked. That's no longer the > case. Well in the end it's of course up to the developers to decide whether this is acceptable or not, but being on the admin/end-user side, I can at least say that not everyone on there would accept "this is no longer the case" as valid explanation when their fs was corrupted or attacked. > Sure, the kernel code should get smarter about refusing to mount in > ambiguous cases, so that a file system isn't nerfed. That shouldn't > happen. But we also need to get away from this idea that dd is > actually an appropriate tool for making a file system copy. Uhm... your view is a bit narrow-sighted... again take the forensics example. But apart from that,... I never said that dd should be the regular tool for people to copy a btrfs image. Typically it would be simply slower than other means. But for some solutions, it may still be the better choice, or at least the only choice implemented right now (e.g. I wouldn't now of a hypervisor system, that looks at an existing disk image, finds any btrfs in that (possibly "hidden" below further block layers), and cleanly copies the data into freshly created btrfs image, with the same structure. AFAIK, there's not even a solution right now, that copies a complete btrfs, with snapshots, etc. preserving all ref-links. At least nothing official that works in one command. Long story, short, I think we can agree, that - dd or not - corruptions or attack vectors shouldn't be possible. And be it just, to protect against the btrfs on hardware RAID1 case, which is accidentally switched to JBOD mode... Cheers, Chris. [0] http://www.tau.ac.il/~tromer/papers/acoustic-20131218.pdf smime.p7s Description: S/MIME cryptographic signature
Still not production ready
Hi! For me it is still not production ready. Again I ran into: btrfs kworker thread uses up 100% of a Sandybridge core for minutes on random write into big file https://bugzilla.kernel.org/show_bug.cgi?id=90401 No matter whether SLES 12 uses it as default for root, no matter whether Fujitsu and Facebook use it: I will not let this onto any customer machine without lots and lots of underprovisioning and rigorous free space monitoring. Actually I will renew my recommendations in my trainings to be careful with BTRFS. >From my experience the monitoring would check for: merkaba:~> btrfs fi show /home Label: 'home' uuid: […] Total devices 2 FS bytes used 156.31GiB devid1 size 170.00GiB used 164.13GiB path /dev/mapper/msata-home devid2 size 170.00GiB used 164.13GiB path /dev/mapper/sata-home If "used" is same as "size" then make big fat alarm. It is not sufficient for it to happen. It can run for quite some time just fine without any issues, but I never have seen a kworker thread using 100% of one core for extended period of time blocking everything else on the fs without this condition being met. In addition to that last time I tried it aborts scrub any of my BTRFS filesstems. Reported in another thread here that got completely ignored so far. I think I could go back to 4.2 kernel to make this work. I am not going to bother to go into more detail on any on this, as I get the impression that my bug reports and feedback get ignored. So I spare myself the time to do this work for now. Only thing I wonder now whether this all could be cause my /home is already more than one and a half year old. Maybe newly created filesystems are created in a way that prevents these issues? But it already has a nice global reserve: merkaba:~> btrfs fi df / Data, RAID1: total=27.98GiB, used=24.07GiB System, RAID1: total=19.00MiB, used=16.00KiB Metadata, RAID1: total=2.00GiB, used=536.80MiB GlobalReserve, single: total=192.00MiB, used=0.00B Actually when I see that this free space thing is still not fixed for good I wonder whether it is fixable at all. Is this an inherent issue of BTRFS or more generally COW filesystem design? I think it got somewhat better. It took much longer to come into that state again than last time, but still, blocking like this is *no* option for a *production ready* filesystem. I am seriously consider to switch to XFS for my production laptop again. Cause I never saw any of these free space issues with any of the XFS or Ext4 filesystems I used in the last 10 years. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel lockup, might be helpful log.
I've finally finished deleting all those nasty unreliable Seagate drives from my array. During the process I crashed my server - over, and over, and over. Completely gone - screen blank, controls unresponsive, no network activity (no, I don't have root on btrfs - data only). Most annoying, but I think btrfs survived it all somehow - it's scrubbing now. Meanwhile, I did get lucky: At one crash I happened to be logged in and was able to hit dmesg seconds before it went completely. So what I have here is information that looks like it'll help you track down a rarely-encountered and hard-to-reproduce bug which can cause the system to lock up completely in event of certain types of hard drive failure. It might be nothing, but perhaps someone will find it of use - because it'd be a tricky one to both reproduce and get a good error report if it did occur. I see an 'invalid opcode' error in here, that's pretty unusual - and again it even gives a file name and line number to look at. The root cause of all my issues is the NCQ issue with Seagate 8TB archive drives, which is Someone Else's Problem - but I think some good can come of this, as these exotic forms of corruption and weird drive semi-failures have revealed ways in which btrfs's error handling could be made more graceful. Meanwhile I remain impressed that btrfs appears to have kept all my data intact even though all these issues. [11668.697976] BTRFS info (device sde1): relocating block group 5932520046592 flags 17 [11676.977183] BTRFS info (device sde1): found 20 extents [11686.138376] BTRFS info (device sde1): found 20 extents [11686.567242] BTRFS info (device sde1): relocating block group 5935741272064 flags 17 [11695.452025] BTRFS info (device sde1): found 17 extents [11704.627191] BTRFS info (device sde1): found 17 extents [11705.966792] BTRFS info (device sde1): relocating block group 5938962497536 flags 17 [11715.343790] BTRFS info (device sde1): found 15 extents [11724.219660] BTRFS info (device sde1): found 15 extents [11724.910970] BTRFS info (device sde1): relocating block group 5940036239360 flags 17 [11733.289804] BTRFS info (device sde1): found 22 extents [11741.538676] BTRFS info (device sde1): found 22 extents [11742.019752] BTRFS info (device sde1): relocating block group 5941109981184 flags 17 [11751.676514] BTRFS info (device sde1): found 14 extents [11759.404371] [ cut here ] [11759.404439] kernel BUG at ../fs/btrfs/extent-tree.c:1832! [11759.404514] invalid opcode: [#1] PREEMPT SMP [11759.404600] Modules linked in: xt_nat nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_conntrack xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables af_packet bridge stp llc iscsi_ibft iscsi_boot_sysfs btrfs xor x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel raid6_pq aesni_intel aes_x86_64 lrw gf128mul iTCO_wdt glue_helper ablk_helper iTCO_vendor_support cryptd pcspkr i2c_i801 ib_mthca lpc_ich tpm_tis 8250_fintek ie31200_edac mfd_core shpchp battery edac_core thermal tpm video fan button processor hid_generic usbhid uas usb_storage amdkfd amd_iommu_v2 radeon igb dca i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt [11759.405914] fb_sys_fops ttm drm xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore usb_common e1000e ptp pps_core fjes vhost_net tun vhost macvtap macvlan sg rpcrdma sunrpc rdma_cm iw_cm ib_ipoib ib_cm ib_sa ib_umad ib_mad ib_core ib_addr [11759.406328] CPU: 2 PID: 2060 Comm: btrfs Not tainted 4.3.0-2-default #1 [11759.406414] Hardware name: FUJITSU PRIMERGY TX100 S3P/D3009-B1, BIOS V4.6.5.3 R1.10.0 for D3009-B1x 12/18/2012 [11759.406555] task: 88042f832040 ti: 88041cae4000 task.ti: 88041cae4000 [11759.406659] RIP: 0010:[] [] insert_inline_extent_backref+0xc6/0xd0 [btrfs] [11759.406815] RSP: 0018:88041cae7830 EFLAGS: 00010293 [11759.406889] RAX: RBX: RCX: 0001 [11759.406986] RDX: 8800 RSI: 0001 RDI: [11759.407085] RBP: 88041cae7890 R08: 4000 R09: 88041cae7748 [11759.407184] R10: R11: 0003 R12: 880412615800 [11759.407283] R13: R14: R15: 8800c92aef50 [11759.407383] FS: 7f2e3b1678c0() GS:88042fd0() knlGS: [11759.407497] CS: 0010 DS: ES: CR0: 80050033 [11759.407576] CR2: 55f473f59f28 CR3: 0004180be000 CR4: 001406e0 [11759.407675] Stack: [11759.407706] 0102 [11759.407831] 0001 88041170d800 32b6 88041170d800 [11759.407949] 88030f0203b0 8800c92aef50 0102 88040b22e000 [11759.408069] Call Trace: [11759.408127]
Re: attacking btrfs filesystems via UUID collisions?
On Sat, 2015-12-12 at 02:34 +0100, S.J. wrote: > A bit more about the dd-is-bad-topic: > > IMHO it doesn't matter at all. Yes, fully agree. > a) For this specific problem here, fixing a security problem > automatically > fixes the risk of data corruption because careless cloning+mounting > (without UUID adjustments) too. > So, if the user likes to use dd with its disadvantages, like waiting > hours to > copy lots of free space, and bad practice, etc.etc., why should it > concern > the Btrfs developers and/or us here? > > b) At wider scope; while Btrfs is more complex than Xfs etc., > currently > there is no other reason why things could go bad when dd'ing > something. > As long as this holds, is there really a place in the official Btrfs > documentation > for telling the users "dd is bad [practice]"? > ... fully agree as well. :-) Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: bad extent [5993525264384, 5993525280768), type mismatch with chunk
Laurent Bonnaud wrote on 2015/12/11 15:21 +0100: On 04/12/2015 01:47, Qu Wenruo wrote: [run btrfsck] I did that, too with an old btrfsck version (4.0) and it found the following errors. Then I did a btrfsck --repair, and I have been able to complete my "du -s" test. The next step will we to run a "btrfs scrub" to check if data loss did happen... Glad to hear that btrfsck --repair can fix it. It seems to be space cache problem, and normally mount with -o clearcache should handle it. But btrfsck --repair should also handle it well. Thanks, Qu # btrfsck /dev/sdb1 Checking filesystem on /dev/sdb1 UUID: f6d4db2e-962b-42db-87b1-35064a4d38e0 checking extents checking free space cache block group 314635714560 has wrong amount of free spacefailed to load free space cache for block group 314635714560 There is no free space entry for 353290420224-353290764288 There is no free space entry for 353290420224-353827291136 cache appears valid but isnt 353290420224 There is no free space entry for 541732175872-541732208640 There is no free space entry for 541732175872-542268981248 cache appears valid but isnt 541732110336 Wanted bytes 32768, found 262144 for off 1008273178624 Wanted bytes 536625152, found 262144 for off 1008273178624 cache appears valid but isnt 1008272932864 block group 1475887497216 has wrong amount of free spacefailed to load free space cache for block group 1475887497216 block group 1823242977280 has wrong amount of free spacefailed to load free space cache for block group 1823242977280 There is no free space entry for 1827001073664-1827002810368 There is no free space entry for 1827001073664-1827537944576 cache appears valid but isnt 1827001073664 There is no free space entry for 1969305501696-1969305518080 There is no free space entry for 1969305501696-1969842290688 cache appears valid but isnt 1969305419776 There is no free space entry for 2021381947392-2021381963776 There is no free space entry for 2021381947392-2021918769152 cache appears valid but isnt 2021381898240 There is no free space entry for 2027287478272-2027287724032 There is no free space entry for 2027287478272-2027824349184 cache appears valid but isnt 2027287478272 There is no free space entry for 2143889227776-2143889244160 There is no free space entry for 2143889227776-2144426000384 cache appears valid but isnt 2143889129472 found 1977224107644 bytes used err is -22 total csum bytes: 1925245108 total tree bytes: 5773115392 total fs tree bytes: 3504685056 total extent tree bytes: 156975104 btree space waste bytes: 780048699 file data blocks allocated: 1971884707840 referenced 1971875930112 btrfs-progs v4.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)
Two more on these: On Thu, 2015-11-26 at 00:33 +, Hugo Mills wrote: > 3) When I would actually disable datacow for e.g. a subvolume that > > holds VMs or DBs... what are all the implications? > > Obviously no checksumming, but what happens if I snapshot such a > > subvolume or if I send/receive it? > After snapshotting, modifications are CoWed precisely once, and > then it reverts to nodatacow again. This means that making a snapshot > of a nodatacow object will cause it to fragment as writes are made to > it. AFAIU, the one the get's fragmented then is the snapshot, right, and the "original" will stay in place where it was? (Which is of course good, because one probably marked it nodatacow, to avoid that fragmentation problem on internal writes). I'd assume the same happens when I do a reflink cp. Can one make a copy, where one still has atomicity (which I guess implies CoW) but where the destination file isn't heavily fragmented afterwards,... i.e. there's some pre-allocation, and then cp really does copy each block (just everything's at the state of time where I stared cp, not including any other internal changes made on the source in between). And one more: You both said, auto-defrag is generally recommended. Does that also apply for SSDs (where we want to avoid unnecessary writes)? It does seem to get enabled, when SSD mode is detected. What would it actually do on an SSD? Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
dear developers, can we have notdatacow + checksumming, plz?
(consider that question being asked with that face on: http://goo.gl/LQaOuA) Hey. I've had some discussions on the list these days about not having checksumming with nodatacow (mostly with Hugo and Duncan). They both basically told me it wouldn't be straight possible with CoW, and Duncan thinks it may not be so much necessary, but none of them could give me really hard arguments, why it cannot work (or perhaps I was just too stupid to understand them ^^)... while at the same time I think that it would be generally utmost important to have checksumming (real world examples below). Also, I remember that in 2014, Ted Ts'o told me that there are some plans ongoing to get data checksumming into ext4, with possibly even some guy at RH actually doing it sooner or later. Since these threads were rather admin-work-centric, developers may have skipped it, therefore, I decided to write down some thoughts label them with a more attracting subject and give it some bigger attention. O:-) 1) Motivation why, it makes sense to have checksumming (especially also in the nodatacow case) I think of all major btrfs features I know of (apart from the CoW itself and having things like reflinks), checksumming is perhaps the one that distinguishes it the most from traditional filesystems. Sure we have snapshots, multi-device support and compression - but we could have had that as well with LVM and software/hardware RAID... (and ntfs supported compression IIRC ;) ). Of course, btrfs does all that in a much smarter way, I know, but it's nothing generally new. The *data* checksumming at filesystem level, to my knowledge, is however. Especially that it's always verified. Awesome. :-) When one starts to get a bit deeper into btrfs (from the admin/end-user side) one sooner or later stumbles across the recommendation/need to use nodatacow for certain types of data (DBs, VM images, etc.) and the reason, AFAIU, being the inherent fragmentation that comes along with the CoW, which is especially noticeable for those types of files with lots of random internal writes. Now duncan implied, that this could improve in the future, with the auto-defragmentation getting (even) better, defrag becoming usable again for those that do snapshots or reflinked copies and btrfs itself generally maturing more and more. But I kinda wonder to what extent one will be really able to solve that, what seems to me a CoW-inherent "problem",... Even *if* one can make the auto-defrag much smarter, it would still mean that such files, like big DBs, VMs, or scientific datasets that are internally rewritten, may get more or less constantly defragmented. That may be quite undesired... a) for performance reasons (when I consider our research software which often has IO as the limiting factor and where we want as much IO being used by actual programs as possible)... b) SSDs... Not really sure about that; btrfs seems to enable the autodefrag even when an SSD is detected,... what is it doing? Placing the block in a smart way on different chips so that accesses can be better parallelised by the controller? Anyway, (a) is could be already argument enough, not to run solve the problem by a smart-[auto-]defrag, should that actually be implemented. So I think having notdatacow is great and not just a workaround till everything else gets better to handle these cases. Thus, checksumming, which is such a vital feature, should also be possible for that. Duncan also mention that in some of those cases, the integrity is already protected by the application layer, making it less important to have it at the fs-layer. Well, this may be true for file-sharing protocols, but I wouldn't know that relational DBs really do cheksuming of the data. They have journals, of course, but these protect against crashes, not against silent block errors and that like. And I wouldn't know that VM hypervisors do checksuming (but perhaps I've just missed that). Here I can give a real-world example, from the Tier-2 that I run for LHC at work/university. We have large amounts of storage (perhaps not as large as what Google and Facebook have, or what the NSA stores about us)... but it's still some ~ 2PiB, or a bit more. That's managed with some special storage management software called dCache. dCache even stores checksums, but per file, so that means for normal reads, these cannot be verified (well technically it's supported, but with our usual file sizes, this is not working) so what remains are scrubs. For The two PiB, we have some... roughly 50-60 nodes, each with something between 12 and 24 disks, usually in either one or two RAID6 volumes, all different kinds of hard disks. And we do run these scrubs quite rarely, since it costs IO that could be used for actual computing jobs (a problem that wouldn't be there with how btrfs calculates the sums on read, the data is then read anyway)... so likely there are even more errors that are just never noticed, because the datasets are removed
[PATCH] btrfs-progs: Format change for btrfs fi df
The GlobalReserve space in 'btrfs fi df' is always confusing for a lot of users. As it is not a special chunk type like DATA or METADATA, it's in fact a sub-type of METADATA. So change the output to skip GlobalReserve by default, and adding its total to metadata used. And add a new option '-r|--reserve' to show the GlobalReserve, but skip the profile of GlobalReserve. Signed-off-by: Qu Wenruo--- Documentation/btrfs-filesystem.asciidoc | 8 ++ cmds-filesystem.c | 51 ++--- 2 files changed, 55 insertions(+), 4 deletions(-) diff --git a/Documentation/btrfs-filesystem.asciidoc b/Documentation/btrfs-filesystem.asciidoc index 31cd51b..510c23f 100644 --- a/Documentation/btrfs-filesystem.asciidoc +++ b/Documentation/btrfs-filesystem.asciidoc @@ -22,6 +22,14 @@ Show space usage information for a mount point. + `Options` + +-r|--reserve +also show Global Reserve space info. ++ +Global Reserve space is reserved space from metadata. It's reserved for Btrfs +metadata COW. ++ +It will be counted as 'used' space in metadata space info. ++ -b|--raw raw numbers in bytes, without the 'B' suffix -h|--human-readable diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 25317fa..26e62e0 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -123,6 +123,8 @@ static const char * const filesystem_cmd_group_usage[] = { static const char * const cmd_filesystem_df_usage[] = { "btrfs filesystem df [options] ", "Show space usage information for a mount point", + "", + "-r|--reserve show global reserve space info" HELPINFO_UNITS_SHORT_LONG, NULL }; @@ -175,12 +177,32 @@ static int get_df(int fd, struct btrfs_ioctl_space_args **sargs_ret) return 0; } -static void print_df(struct btrfs_ioctl_space_args *sargs, unsigned unit_mode) +static void print_df(struct btrfs_ioctl_space_args *sargs, unsigned unit_mode, +int show_reserve) { u64 i; + u64 global_reserve = 0; struct btrfs_ioctl_space_info *sp = sargs->spaces; + /* First iterate to get global reserve space size */ for (i = 0; i < sargs->total_spaces; i++, sp++) { + if (sp->flags & BTRFS_SPACE_INFO_GLOBAL_RSV) + global_reserve = sp->total_bytes; + } + + for (i = 0, sp = sargs->spaces; i < sargs->total_spaces; i++, sp++) { + if (sp->flags & BTRFS_SPACE_INFO_GLOBAL_RSV) { + if (!show_reserve) + continue; + printf(" \\- %s: reserved=%s, used=%s\n", + btrfs_group_type_str(sp->flags), + pretty_size_mode(sp->total_bytes, unit_mode), + pretty_size_mode(sp->used_bytes, unit_mode)); + continue; + } + + if (sp->flags & BTRFS_BLOCK_GROUP_METADATA) + sp->used_bytes += global_reserve; printf("%s, %s: total=%s, used=%s\n", btrfs_group_type_str(sp->flags), btrfs_group_profile_str(sp->flags), @@ -196,14 +218,35 @@ static int cmd_filesystem_df(int argc, char **argv) int fd; char *path; DIR *dirstream = NULL; + int show_reserve = 0; unsigned unit_mode; unit_mode = get_unit_mode_from_arg(, argv, 1); - if (argc != 2 || argv[1][0] == '-') + while (1) { + int c; + static const struct option long_options[] = { + { "reserve", no_argument, NULL, 'r'}, + { NULL, 0, NULL, 0} + }; + + c = getopt_long(argc, argv, "r", long_options, NULL); + if (c < 0) + break; + switch (c) { + case 'r': + show_reserve = 1; + break; + default: + usage(cmd_filesystem_df_usage); + } + } + + argc = argc - optind; + if (check_argc_exact(argc, 1)) usage(cmd_filesystem_df_usage); - path = argv[1]; + path = argv[optind]; fd = btrfs_open_dir(path, , 1); if (fd < 0) @@ -212,7 +255,7 @@ static int cmd_filesystem_df(int argc, char **argv) ret = get_df(fd, ); if (ret == 0) { - print_df(sargs, unit_mode); + print_df(sargs, unit_mode, show_reserve); free(sargs); } else { fprintf(stderr, "ERROR: get_df failed %s\n", strerror(-ret)); -- 2.6.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check inconsistency with raid1, part 1
Chris Murphy wrote on 2015/12/13 21:16 -0700: Part 1= What to do about it? This post. Part 2 = How I got here? I'm still working on the write up, so it's not yet posted. Summary: 2 dev (spinning rust) raid1 for data and metadata. kernel 4.2.6, btrfs-progs 4.2.2 btrfs check with devid 1 and 2 present produces thousands of scary messages, e.g. checksum verify failed on 714189357056 found E4E3BDB6 wanted Checked the full output. The interesting part is, the calculated result is always E4E3BDB6, and wanted is always all 0. I assume E4E3BDB6 is crc32 of all 0 data. If there is a full disk dump, it will be much easier to find where the problem is. But I'm a afraid it won't be possible. At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong with the bytenr in the warning. The good news is, the fs seems to be OK without major problem. As except the csum error, btrfsck doesn't give other error/warning. btrfs check with devid 1 or devid2 separate (the other is missing) produces no such scary messages at all, but instead messages e.g. failed to load free space cache for block group 357585387520 a. This inconsistency is unexpected. b. the 'btrfs check' with combined devices gives no insight to the seriousness of "checksum verify failed" messages, or what the solution is. I guess btrfsck did the wrong device assemble, but that's just my personal guess. And since I can't reproduce in my test environment, it won't be easy to find the root cause. c. combined or separate+degraded, read-only mounts succeed with no errors in user space or dmesg; only normal mount messages happen. With both devs ro mounted, I was able to completely btrfs send/receive the most recent two ro snapshots comprising 100% (minus stale historical) data on the drive, with zero errors reported. d. no read-write mount attempt has happened since "the incident" which will be detailed in part 2. Details: The full devid1&2 btrfs check is long and not very interesting, so I've put that here: https://drive.google.com/open?id=0B_2Asp8DGjJ9Vjd0VlNYb09LVFU btrfs-show-super shows some differences, values denoted as devid1/devid2. If there's no split, those values are the same for both devids. generation4924/4923 root714189258752/714188554240 sys_array_size129 chunk_root_generation4918 root_level1 chunk_root715141414912 chunk_root_level1 log_root0 log_root_transid0 log_root_level0 total_bytes1500312748032 bytes_used537228206080 sectorsize4096 nodesize16384 [snip] cache_generation4924/4923 uuid_tree_generation4924/4923 [snip] dev_item.total_bytes750156374016 dev_item.bytes_used541199433728 Perhaps useful, is at the time of "the incident" this volume was rw mounted, but was being used by a single process only: btrfs send. So it was used as a source. No writes, other than btrfs's own generation increment, were happening. So in theory, this should perhaps be the simplest case of "what do I do now?" and even makes me wonder if a normal rw mount should just fix this up: either btrfs uses generation 4924 and updates all changes from 4923 and 4924 automatically to devid2 so they are now in sync, or it automatically discards generation 4924 from devid1, so both devices are in sync. The workload, circumstances of "the incident", the general purpose of btrfs, and the likelihood a typical user would never have even become aware of "the incident" until much later than I did, makes me strongly feel like Btrfs should be able to completely recover from this, with just a rw mount and eventually the missync'd generations will autocorrect. But I don't know that. And I get essentially no advice from btrfs check results. So. What's the theory in this case? And then does it differ from reality? Personally speaking, it may be a false alert from btrfsck. So in this case, I can't provide much help. If you're brave enough, mount it rw to see what will happen(although it may mount just OK). Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Will "btrfs check --repair" fix the mounting problem?
Chris Murphy wrote on 2015/12/11 11:24 -0700: On Fri, Dec 11, 2015 at 10:50 AM, Ivan Sizovwrote: Btrfs crashes in few seconds after mounting RW. If it's important: the volume was converted from ext4. "ext2_saved" subvolume still presents. dmesg: [ 625.998387] BTRFS info (device sda1): disk space caching is enabled [ 625.998392] BTRFS: has skinny extents [ 627.727708] BTRFS: checking UUID tree [ 708.514128] [ cut here ] [ 708.514161] WARNING: CPU: 1 PID: 2263 at fs/btrfs/extent-tree.c:6255 __btrfs_free_extent.isra.68+0x8c8/0xd70 [btrfs]() [ 708.514164] Modules linked in: bnep bluetooth rfkill ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute bridge ebtable_filter ebtable_nat ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_security ip6table_mangle ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_security iptable_mangle gpio_ich coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device lpc_ich snd_pcm snd_timer ppdev snd i2c_i801 mei_me mei soundcore parport_pc parport shpchp tpm_infineon tpm_tis tpm acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace isofs squashfs btrfs xor raid6_pq i915 hid_logitech_hidpp [ 708.514277] 8021q garp stp video llc mrp i2c_algo_bit drm_kms_helper r8169 uas crc32c_intel drm serio_raw mii hid_logitech_dj usb_storage scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc loop [ 708.514311] CPU: 1 PID: 2263 Comm: btrfs-transacti Not tainted 4.2.3-300.fc23.x86_64 #1 [ 708.514315] Hardware name: MSI MS-7636/H55M-P31(MS-7636) , BIOS V1.9 09/14/2010 [ 708.514319] f50458a6 880066b03ad8 81771fca [ 708.514326] 880066b03b18 8109e4a6 [ 708.514332] 0002 00252f595000 fffe [ 708.514338] Call Trace: [ 708.514349] [] dump_stack+0x45/0x57 [ 708.514359] [] warn_slowpath_common+0x86/0xc0 [ 708.514365] [] warn_slowpath_null+0x1a/0x20 [ 708.514391] [] __btrfs_free_extent.isra.68+0x8c8/0xd70 [btrfs] [ 708.514429] [] ? find_ref_head+0x5a/0x80 [btrfs] [ 708.514456] [] __btrfs_run_delayed_refs+0x998/0x1080 [btrfs] Not completely sure, but it may be related to a regression in 4.2. The regression it self is already fixed, but is not backported to 4.2 as far as I know. So, I'd recommend to revert to 4.1 and see if things get better. Fortunately, btrfs already aborted the transaction before things get worse. [ 708.514477] [] btrfs_run_delayed_refs.part.73+0x74/0x270 [btrfs] [ 708.514496] [] btrfs_run_delayed_refs+0x15/0x20 [btrfs] [ 708.514518] [] btrfs_commit_transaction+0x56/0xad0 [btrfs] [ 708.514541] [] transaction_kthread+0x214/0x230 [btrfs] [ 708.514564] [] ? btrfs_cleanup_transaction+0x500/0x500 [btrfs] [ 708.514569] [] kthread+0xd8/0xf0 [ 708.514574] [] ? kthread_worker_fn+0x160/0x160 [ 708.514581] [] ret_from_fork+0x3f/0x70 [ 708.514585] [] ? kthread_worker_fn+0x160/0x160 [ 708.514588] ---[ end trace 673f3bf2295a ]--- [ 708.514594] BTRFS info (device sda1): leaf 535035904 total ptrs 204 free space 4451 [ 708.514598] item 0 key (159696797696 169 0) itemoff 16250 itemsize 33 [ 708.514601] extent refs 1 gen 21134 flags 2 [ 708.514604] tree block backref root 2 [ 708.514609] item 1 key (159696830464 169 1) itemoff 16217 itemsize 33 [ 708.514612] extent refs 1 gen 21134 flags 2 [ 708.514615] tree block backref root 2 [ 708.514619] item 2 key (159696846848 169 0) itemoff 16184 itemsize 33 *** a lot of similar messages *** [ 708.516923] item 203 key (159711268864 169 0) itemoff 9551 itemsize 33 [ 708.516927] extent refs 1 gen 21082 flags 2 [ 708.516930] tree block backref root 384 [ 708.516937] BTRFS error (device sda1): unable to find ref byte nr 159708172288 parent 0 root 385 owner 2 offset 0 [ 708.516944] [ cut here ] [ 708.516975] WARNING: CPU: 1 PID: 2263 at fs/btrfs/extent-tree.c:6261 __btrfs_free_extent.isra.68+0x92f/0xd70 [btrfs]() [ 708.516979] BTRFS: Transaction aborted (error -2) [ 708.516982] Modules linked in: bnep bluetooth rfkill ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_broute bridge ebtable_filter ebtable_nat ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_security ip6table_mangle ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_security iptable_mangle gpio_ich coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device lpc_ich
btrfs check inconsistency with raid1, part 1
Part 1= What to do about it? This post. Part 2 = How I got here? I'm still working on the write up, so it's not yet posted. Summary: 2 dev (spinning rust) raid1 for data and metadata. kernel 4.2.6, btrfs-progs 4.2.2 btrfs check with devid 1 and 2 present produces thousands of scary messages, e.g. checksum verify failed on 714189357056 found E4E3BDB6 wanted btrfs check with devid 1 or devid2 separate (the other is missing) produces no such scary messages at all, but instead messages e.g. failed to load free space cache for block group 357585387520 a. This inconsistency is unexpected. b. the 'btrfs check' with combined devices gives no insight to the seriousness of "checksum verify failed" messages, or what the solution is. c. combined or separate+degraded, read-only mounts succeed with no errors in user space or dmesg; only normal mount messages happen. With both devs ro mounted, I was able to completely btrfs send/receive the most recent two ro snapshots comprising 100% (minus stale historical) data on the drive, with zero errors reported. d. no read-write mount attempt has happened since "the incident" which will be detailed in part 2. Details: The full devid1&2 btrfs check is long and not very interesting, so I've put that here: https://drive.google.com/open?id=0B_2Asp8DGjJ9Vjd0VlNYb09LVFU btrfs-show-super shows some differences, values denoted as devid1/devid2. If there's no split, those values are the same for both devids. generation4924/4923 root714189258752/714188554240 sys_array_size129 chunk_root_generation4918 root_level1 chunk_root715141414912 chunk_root_level1 log_root0 log_root_transid0 log_root_level0 total_bytes1500312748032 bytes_used537228206080 sectorsize4096 nodesize16384 [snip] cache_generation4924/4923 uuid_tree_generation4924/4923 [snip] dev_item.total_bytes750156374016 dev_item.bytes_used541199433728 Perhaps useful, is at the time of "the incident" this volume was rw mounted, but was being used by a single process only: btrfs send. So it was used as a source. No writes, other than btrfs's own generation increment, were happening. So in theory, this should perhaps be the simplest case of "what do I do now?" and even makes me wonder if a normal rw mount should just fix this up: either btrfs uses generation 4924 and updates all changes from 4923 and 4924 automatically to devid2 so they are now in sync, or it automatically discards generation 4924 from devid1, so both devices are in sync. The workload, circumstances of "the incident", the general purpose of btrfs, and the likelihood a typical user would never have even become aware of "the incident" until much later than I did, makes me strongly feel like Btrfs should be able to completely recover from this, with just a rw mount and eventually the missync'd generations will autocorrect. But I don't know that. And I get essentially no advice from btrfs check results. So. What's the theory in this case? And then does it differ from reality? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
On Sunday 13 Dec 2015 12:18:55 Alex Lyakas wrote: > [Resending in plain text, apologies.] > > Hi Chandan, Josef, Chris, > > I am not sure I understand the fix to the problem. > > It may happen that when updating the device tree, we need to allocate a new > chunk via do_chunk_alloc (while we are holding the device tree root node > locked). This is a legitimate thing for find_free_extent() to do. And > do_chunk_alloc() call may lead to call to > btrfs_create_pending_block_groups(), which will try to update the device > tree. This may happen due to direct call to > btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or > perhaps by __btrfs_end_transaction() that find_free_extent() does after it > completed chunk allocation (although in this case it will use the > transaction that already exists in current->journal_info). > So the deadlock still may happen? Hello Alex, The "global block reservation" (see btrfs_fs_info->global_block_rsv) aims to solve this problem. I don't claim to have understood the behaviour of global_block_rsv completely. However, Global block reservation makes sure that we have enough free space reserved (see update_global_block_rsv()) for future operations on, - Extent tree - Checksum tree - Device tree - Tree root tree and - Quota tree. Tasks changing the device tree should get their space requirements satisfied from the global block reservation. Hence such changes to the device tree should not end up forcing find_free_extent() to allocate a new chunk. -- chandan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
Qu Wenruo posted on Mon, 14 Dec 2015 10:08:16 +0800 as excerpted: > Martin Steigerwald wrote on 2015/12/13 23:35 +0100: >> Hi! >> >> For me it is still not production ready. > > Yes, this is the *FACT* and not everyone has a good reason to deny it. In the above sentence, I /think/ you (Qu) agree with Martin (and I) that btrfs shouldn't be considered production ready... yet, and the first part of the sentence makes it very clear that you feel strongly about the *FACT*, but the second half of the sentence (after *FACT*) doesn't parse well in English, thus leaving the entire sentence open to interpretation, tho it's obvious either way that you feel strongly about it. =:^\ At the risk of getting it completely wrong, what I /think/ you meant to say is (as expanded in typically Duncan fashion =:^)... Yes, this is the *FACT*, though some people have reasons to deny it. Presumably, said reasons would include the fact that various distros are trying to sell enterprise support contracts to customers very eager to have the features that btrfs provides, and said customers are willing to pay for assurances that the solutions they're buying are "production ready", whether that's actually the case or not, presumably because said payment is (in practice) simply ensuring there's someone else to pin the blame on if things go bad. And the demonstration of that would be the continued fact that people otherwise unnecessarily continue to pay rather large sums of money for that very assurance, when in practice, they'd get equal or better support not worrying about that payment, but instead actually making use of free- of-cost resources such as this list. [Linguistic analysis, see frequent discussion of this topic at Language Log, which I happen to subscribe to as I find this sort of thing interesting, for more commentary and examples of the same general issue: http://languagelog.net ] The problem with the sentence as originally written, is that English doesn't deal well with multi-negation, sometimes considering each negation an inversion of the previous (as do most programming languages and thus programmers), while other times or as read/heard/interpreted by others repeated negation may be considered a strengthening of the original negation. Regardless, mis-negation due to speaker/writer confusion is quite common even among native English speakers/writers. The negating words in question here are "not" and "deny". If you will note, my rewrite kept "deny", but rewrote the "not" out of the sentence, so there's only one negative to worry about, making the meaning much clearer as the reader's mind isn't left trying to figure out what the speaker meant with the double-negative (mistake? deliberate canceling out of the first negative with the second? deliberate intensifier?) and thus unable to be sure one way or the other what was meant. And just in case there would have been doubt, the explanation then makes doubly obvious what I think your intent was by expanding on it. Of course that's easy to do as I entirely agree. OTOH if I'm mistaken as to your intent and you meant it the other way... well then you'll need to do the explaining as then the implication is that some people have good reasons to deny it and you agree with them, but without further expansion, I wouldn't know where you're trying to go with that claim. Just in case there's any doubt left of my own opinion on the original claim of not production ready in the above discussion, let me be explicit: I (too) agree with Martin (and I think with Qu) that btrfs isn't yet production ready. But I don't believe you'll find many on the list taking issue with that, as I think everybody on-list agrees, btrfs /isn't/ production ready. Certainly pretty much just that has been repeatedly stated in individualized style by many posters including myself, and I've yet to see anyone take serious issue with it. >> No matter whether SLES 12 uses it as default for root, no matter >> whether Fujitsu and Facebook use it: I will not let this onto any >> customer machine without lots and lots of underprovisioning and >> rigorous free space monitoring. >> Actually I will renew my recommendations in my trainings to be careful >> with BTRFS. ... And were I to put money on it, my money would be on every regular on- list poster 100% agreeing with that. =:^) >> >> From my experience the monitoring would check for: >> >> merkaba:~> btrfs fi show /home >> Label: 'home' uuid: […] >> Total devices 2 FS bytes used 156.31GiB >> devid1 size 170.00GiB used 164.13GiB path /dev/[path1] >> devid2 size 170.00GiB used 164.13GiB path /dev/[path2] >> >> If "used" is same as "size" then make big fat alarm. It is not >> sufficient for it to happen. It can run for quite some time just fine >> without any issues, but I never have seen a kworker thread using 100% >> of one core for extended period of time
Re: Kernel lockup, might be helpful log.
Birdsarenice posted on Sun, 13 Dec 2015 22:55:19 + as excerpted: > Meanwhile, I did get lucky: At one crash I happened to be logged in and > was able to hit dmesg seconds before it went completely. So what I have > here is information that looks like it'll help you track down a > rarely-encountered and hard-to-reproduce bug which can cause the system > to lock up completely in event of certain types of hard drive failure. > It might be nothing, but perhaps someone will find it of use - because > it'd be a tricky one to both reproduce and get a good error report if it > did occur. > > I see an 'invalid opcode' error in here, that's pretty unusual Disclaimer: I'm a list regular and (small-scale) sysadmin, not a dev, and most certainly not a btrfs dev. Take what I saw with that in mind, tho I've been active on-list for over a year and thus now have a reasonable level of practical sysadmin configuration and crisis recovery level btrfs experience. You could well be quite correct with the unusual crash log and its value, I'll leave that up to the devs to decide, but that "invalid opcode: " bit is in fact not at all unusual on btrfs. Tho I can say it fooled me originally as well, because it certainly /looks/ both suspicious and in general unusual. Based on how a dev explained it to me, I believe btrfs actually deliberately uses opcode to trigger a semi-controlled crash in instances where code that "should never happen" actually gets executed for some reason, leaving the kernel is an unknown and thus not trustworthy enough to reliably write to storage devices and do a controlled shutdown. That's of course why the tracebacks are there, to help the devs figure out where it was and what triggered it, but the opcode itself is actually quite frequently found in these tracebacks, because it's the method chosen to deliberately trigger them. I'd guess the same technique is actually used in various other (non- btrfs) kernel code as well, but in fully stable code it actually is very rarely seen, precisely because it /does/ mean the kernel reached code that it is never expected to reach, meaning something specific went wrong to get to that point, and in fully stable code, it's rare that any code paths actually leading to that sort of execution point remain, as they've all been found over the years. But of course btrfs, while no longer experimental, remains "still stabilizing and maturing, not yet fully stable or mature", so there's still code paths left that do still occasionally reach these intended to be unreachable code points, and when that happens, triggering a crash and hopefully getting a traceback that helps the devs figure out which code path has the bug and why, is a good thing to do, and this is apparently the way it's done. (BTW, compliments on the nick and email address. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs check inconsistency with raid1, backstory, part 2
(sdb are a raid0, btrfs receive destination, they don't come up in this dmesg at all) sdd are a raid1, btrfs send source All four devices are in USB enclosures on a new Intel NUC that's had ~32 hours burnin with memtest. Both file systems had received many scrubs before with zero errors all time, include most recently within the past few days. What's new about today's setup is the NUC, and the fact all four drives are directly connected, not to a USB hub. So right off the bat I'm going to suspect hardware problems are due to insufficient USB bus power. kernel messages: https://drive.google.com/open?id=0B_2Asp8DGjJ9Z0hhbVUwakF5Y2c Lines 6-28 I'm pretty sure what happens first is sdd is producing spurious data (bad tree block start). I can't tell if the 'read error correct' messages are fixing sde or sdd? In any case, since the last thing that happened before this was a passing scrub, none of these corrections written to disk are warranted and are suspect. Maybe what happens is the reads are bad but the writes back to the device (corrections) are OK. Next, line 29 to 48, looks like there is a USB bus reset, sde vanishes off the bus, and reappears as sdf, at which point thousands of write errors ensue. And then I became aware of all of this, did an abrupt shutdown (poweroff -f) at which point the journal ends. And then we go to part 1 for what I did next to try to recover. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dear developers, can we have notdatacow + checksumming, plz?
On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote: > I've had some discussions on the list these days about not having > checksumming with nodatacow (mostly with Hugo and Duncan). > > They both basically told me it wouldn't be straight possible with CoW, > and Duncan thinks it may not be so much necessary, but none of them > could give me really hard arguments, why it cannot work (or perhaps I > was just too stupid to understand them ^^)... while at the same time I > think that it would be generally utmost important to have checksumming > (real world examples below). My understanding of BTRFS is that the metadata referencing data blocks has the checksums for those blocks, then the blocks which link to that metadata (EG directory entries referencing file metadata) has checksums of those. For each metadata block there is a new version that is eventually linked from a new version of the tree root. This means that the regular checksum mechanisms can't work with nocow data. A filesystem can have checksums just pointing to data blocks but you need to cater for the case where a corrupt metadata block points to an old version of a data block and matching checksum. The way that BTRFS works with an entire checksumed tree means that there's no possibility of pointing to an old version of a data block. The NetApp published research into hard drive errors indicates that they are usually in small numbers and located in small areas of the disk. So if BTRFS had a nocow file with any storage method other than dup you would have metadata and file data far enough apart that they are not likely to be hit by the same corruption (and the same thing would apply with most Ext4 Inode tables and data blocks). I think that a file mode where there were checksums on data blocks with no checksums on the metadata tree would be useful. But it would require a moderate amount of coding and there's lots of other things that the developers are working on. -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check inconsistency with raid1, part 1
Thanks for the reply. On Sun, Dec 13, 2015 at 10:48 PM, Qu Wenruowrote: > > > Chris Murphy wrote on 2015/12/13 21:16 -0700: >> btrfs check with devid 1 and 2 present produces thousands of scary >> messages, e.g. >> checksum verify failed on 714189357056 found E4E3BDB6 wanted > > > Checked the full output. > The interesting part is, the calculated result is always E4E3BDB6, and > wanted is always all 0. > > I assume E4E3BDB6 is crc32 of all 0 data. > > > If there is a full disk dump, it will be much easier to find where the > problem is. > But I'm a afraid it won't be possible. What is a full disk dump? I can try to see if it's possible. Main thing though is only if it can make Btrfs overall better, because I don't need this volume repaired, there's no data loss (backups!) so this volume's purpose now is for study. > At least, 'btrfs-debug-tree -t 2' should help to locate what's wrong with > the bytenr in the warning. Both devs attached (not mounted). [root@f23a ~]# btrfs-debug-tree -t 2 /dev/sdb > btrfsdebugtreet2_verb.txt checksum verify failed on 714189570048 found E4E3BDB6 wanted checksum verify failed on 714189570048 found E4E3BDB6 wanted checksum verify failed on 714189471744 found E4E3BDB6 wanted checksum verify failed on 714189471744 found E4E3BDB6 wanted checksum verify failed on 714189357056 found E4E3BDB6 wanted checksum verify failed on 714189357056 found E4E3BDB6 wanted checksum verify failed on 714189750272 found E4E3BDB6 wanted checksum verify failed on 714189750272 found E4E3BDB6 wanted https://drive.google.com/open?id=0B_2Asp8DGjJ9NUdmdXZFQ1Myek0 > > > The good news is, the fs seems to be OK without major problem. > As except the csum error, btrfsck doesn't give other error/warning. Yes, I think so. Main issue here seems to be the scary warnings and uncertainty what the user should do next, if anything at all. > I guess btrfsck did the wrong device assemble, but that's just my personal > guess. > And since I can't reproduce in my test environment, it won't be easy to find > the root cause. It might be reproducible. More on that in the next email. Easy to get you remote access if useful. >> So. What's the theory in this case? And then does it differ from reality? > > > Personally speaking, it may be a false alert from btrfsck. > So in this case, I can't provide much help. > > If you're brave enough, mount it rw to see what will happen(although it may > mount just OK). I'm brave enough. I'll give it a try tomorrow unless there's another request for more info before then. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Still not production ready
Duncan wrote on 2015/12/14 06:21 +: Qu Wenruo posted on Mon, 14 Dec 2015 10:08:16 +0800 as excerpted: Martin Steigerwald wrote on 2015/12/13 23:35 +0100: Hi! For me it is still not production ready. Yes, this is the *FACT* and not everyone has a good reason to deny it. In the above sentence, I /think/ you (Qu) agree with Martin (and I) that btrfs shouldn't be considered production ready... yet, and the first part of the sentence makes it very clear that you feel strongly about the *FACT*, but the second half of the sentence (after *FACT*) doesn't parse well in English, thus leaving the entire sentence open to interpretation, tho it's obvious either way that you feel strongly about it. =:^\ Oh, my poor English... :( The latter half is just in case someone consider btrfs is stable in some respects. At the risk of getting it completely wrong, what I /think/ you meant to say is (as expanded in typically Duncan fashion =:^)... Yes, this is the *FACT*, though some people have reasons to deny it. Right! That's what I want to say!! Presumably, said reasons would include the fact that various distros are trying to sell enterprise support contracts to customers very eager to have the features that btrfs provides, and said customers are willing to pay for assurances that the solutions they're buying are "production ready", whether that's actually the case or not, presumably because said payment is (in practice) simply ensuring there's someone else to pin the blame on if things go bad. And the demonstration of that would be the continued fact that people otherwise unnecessarily continue to pay rather large sums of money for that very assurance, when in practice, they'd get equal or better support not worrying about that payment, but instead actually making use of free- of-cost resources such as this list. [Linguistic analysis, see frequent discussion of this topic at Language Log, which I happen to subscribe to as I find this sort of thing interesting, for more commentary and examples of the same general issue: http://languagelog.net ] The problem with the sentence as originally written, is that English doesn't deal well with multi-negation, sometimes considering each negation an inversion of the previous (as do most programming languages and thus programmers), while other times or as read/heard/interpreted by others repeated negation may be considered a strengthening of the original negation. Regardless, mis-negation due to speaker/writer confusion is quite common even among native English speakers/writers. The negating words in question here are "not" and "deny". If you will note, my rewrite kept "deny", but rewrote the "not" out of the sentence, so there's only one negative to worry about, making the meaning much clearer as the reader's mind isn't left trying to figure out what the speaker meant with the double-negative (mistake? deliberate canceling out of the first negative with the second? deliberate intensifier?) and thus unable to be sure one way or the other what was meant. And just in case there would have been doubt, the explanation then makes doubly obvious what I think your intent was by expanding on it. Of course that's easy to do as I entirely agree. OTOH if I'm mistaken as to your intent and you meant it the other way... well then you'll need to do the explaining as then the implication is that some people have good reasons to deny it and you agree with them, but without further expansion, I wouldn't know where you're trying to go with that claim. Just in case there's any doubt left of my own opinion on the original claim of not production ready in the above discussion, let me be explicit: I (too) agree with Martin (and I think with Qu) that btrfs isn't yet production ready. But I don't believe you'll find many on the list taking issue with that, as I think everybody on-list agrees, btrfs /isn't/ production ready. Certainly pretty much just that has been repeatedly stated in individualized style by many posters including myself, and I've yet to see anyone take serious issue with it. No matter whether SLES 12 uses it as default for root, no matter whether Fujitsu and Facebook use it: I will not let this onto any customer machine without lots and lots of underprovisioning and rigorous free space monitoring. Actually I will renew my recommendations in my trainings to be careful with BTRFS. ... And were I to put money on it, my money would be on every regular on- list poster 100% agreeing with that. =:^) From my experience the monitoring would check for: merkaba:~> btrfs fi show /home Label: 'home' uuid: […] Total devices 2 FS bytes used 156.31GiB devid1 size 170.00GiB used 164.13GiB path /dev/[path1] devid2 size 170.00GiB used 164.13GiB path /dev/[path2] If "used" is same as "size" then make big fat alarm. It is not sufficient for it to happen. It can run for quite some time just fine without any
Re: Kernel lockup, might be helpful log.
I can't help with the call traces. But several (not all) of the hard resetting link messages are hallmark cases where the SCSI command timer default of 30 seconds looks like it's being hit while the drive itself is hung up doing a sector read recovery (multiple attempts). It's worth seeing if 'smartctl -l scterc ' will report back that SCT is supported and that it's just disabled, meaning you can change this to something sane like with 'smartctl -l 70,70 ' which will make the drive time out before the linux kernel command timer. That'll let Btrfs do the right thing, rather than constantly getting poked in both eyes by link resets. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html