On 28.05.2018 14:51, Qu Wenruo wrote: > > > On 2018年05月28日 16:51, Nikolay Borisov wrote: >> >> >> On 28.05.2018 11:35, Nikolay Borisov wrote: >>> >>> >>> On 28.05.2018 11:21, Misono Tomohiro wrote: >>>> Hello, >>>> >>>> I found current misc-next sometimes fails btrfs/152 when the number >>>> of cpu is >= 4 in my vm and git bisect points this commit. >>>> (btrfs/152 performs parallel send/receive.) >>>> >>>> The failure is caused by _check_dmesg and sometimes also leads to >>>> inconsistent fs. >>>> >>>> dmesg looks like: >>>> [ 6.649213] WARNING: CPU: 0 PID: 2838 at fs/btrfs/transaction.c:303 >>>> record_root_in_trans+0x38/0xd0 >>>> [ 6.650807] CPU: 0 PID: 2838 Comm: btrfs Tainted: G W >>>> 4.17.0-rc6+ #113 >>>> [ 6.651998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>>> 1.10.2-2.fc27 04/01/2014 >>>> [ 6.653325] RIP: 0010:record_root_in_trans+0x38/0xd0 >>>> [ 6.654090] RSP: 0018:ffff90b781e23a68 EFLAGS: 00010206 >>>> [ 6.654882] RAX: ffff8f346a48b720 RBX: ffff8f3479ce86c0 RCX: >>>> 0000000000000000 >>>> [ 6.655965] RDX: 0000000000000000 RSI: ffff8f347c8ec000 RDI: >>>> ffff8f346a03a958 >>>> [ 6.657059] RBP: ffff90b781e23b80 R08: 0000000000025870 R09: >>>> ffffffffa8c112bb >>>> [ 6.658162] R10: ffff90b781e23ba0 R11: 0000000000000002 R12: >>>> ffff8f346a03a958 >>>> [ 6.659290] R13: ffff8f347c8d9000 R14: ffff8f347ba2a0b8 R15: >>>> 0000000000000000 >>>> [ 6.660374] FS: 00007f3b62ebc8c0(0000) GS:ffff8f347fc00000(0000) >>>> knlGS:0000000000000000 >>>> [ 6.661349] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [ 6.661928] CR2: 00007f0a641995e0 CR3: 000000006af5c000 CR4: >>>> 00000000000006f0 >>>> [ 6.662651] Call Trace: >>>> [ 6.662909] create_pending_snapshot+0x1ab/0xd00 >>>> [ 6.663391] ? btrfs_delayed_inode_release_metadata+0xe0/0xf0 >>>> [ 6.663972] ? __btrfs_update_delayed_inode+0x1aa/0x210 >>>> [ 6.664526] ? __btrfs_release_delayed_node.part.18+0x8f/0x1d0 >>>> [ 6.665116] ? create_pending_snapshots+0x81/0xa0 >>>> [ 6.665597] create_pending_snapshots+0x81/0xa0 >>>> [ 6.666068] btrfs_commit_transaction+0x279/0x860 >>>> [ 6.666553] ? start_transaction+0x98/0x3c0 >>>> [ 6.666989] btrfs_mksubvol+0x589/0x5a0 >>>> [ 6.667390] ? btrfs_opendir+0x39/0x70 >>>> [ 6.667778] btrfs_ioctl_snap_create_transid+0x16a/0x1a0 >>>> [ 6.668327] btrfs_ioctl_snap_create_v2+0x121/0x180 >>>> [ 6.668827] btrfs_ioctl+0x56d/0x25a0 >>>> [ 6.669205] ? do_filp_open+0xaa/0x110 >>>> [ 6.669591] ? do_vfs_ioctl+0x9f/0x610 >>>> [ 6.669980] ? btrfs_ioctl_get_supported_features+0x20/0x20 >>>> [ 6.670550] do_vfs_ioctl+0x9f/0x610 >>>> [ 6.670922] ksys_ioctl+0x6b/0x80 >>>> [ 6.671261] __x64_sys_ioctl+0x11/0x20 >>>> [ 6.671650] do_syscall_64+0x49/0x100 >>>> [ 6.672040] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> [ 6.672545] RIP: 0033:0x7f3b61cbddd7 >>>> [ 6.672899] RSP: 002b:00007ffcdbe7b318 EFLAGS: 00000246 ORIG_RAX: >>>> 0000000000000010 >>>> [ 6.673666] RAX: ffffffffffffffda RBX: 00007ffcdbe90610 RCX: >>>> 00007f3b61cbddd7 >>>> [ 6.674430] RDX: 00007ffcdbe7b360 RSI: 0000000050009417 RDI: >>>> 0000000000000003 >>>> [ 6.675154] RBP: 00005634403ec020 R08: 000000000000ffff R09: >>>> 0000000000001010 >>>> [ 6.675878] R10: 0000000000000000 R11: 0000000000000246 R12: >>>> 00005634403ed670 >>>> [ 6.676604] R13: 00007ffcdbe7c470 R14: 0000000000000053 R15: >>>> 00007ffcdbe7b360 >>>> [ 6.677330] Code: f0 01 00 00 a8 02 74 27 48 8b 07 48 39 86 40 03 00 00 >>>> 73 1b 49 39 75 28 0f 84 91 00 00 00 85 d2 75 17 48 8b 06 48 39 46 08 74 0e >>>> <0f> 0b eb 0a 85 d2 74 65 49 3b 75 28 74 76 41 89 d4 48 89 f3 48 >>>> [ 6.679318] ---[ end trace b2378f91e69026c3 ]--- >>>> >>>> transaction.c:303: >>>> WARN_ON(!force && root->commit_root != root->node); >>>> >>>> And 152.out.full looks like: >>>> *** fsck.btrfs output *** >>>> root item for root 261, current bytenr 53870592, current gen 149, current >>>> level 0, new bytenr 54067200, new gen 149, new level 0 >>>> Found 1 roots with an outdated root item. >>>> Please run a filesystem check with the option --repair to fix them. >>>> >>>> ===== >>>> (My observation) >>>> >>>> Call chain is: >>>> btrfs_commit_transaction() >>>> create_pending_snapshots() >>>> create_pending_snapshot() >>>> run_delayed_item() <- removed >>>> qgroup_account_snapshot() >>>> run_delayed_item() <- added >>>> >>>> This commit removes run_delayed_item() in create_pending_snapshot() and >>>> instead calls it after create_pending_snapshots(). >>>> >>>> However, as qgroup_account_snapshot() updates commit_root, if there are >>>> more than one pending snapshots under the same subvolume, it may fail >>>> to update root item correctly. >>>> >>>> So, I think we cannot remove run_delayed_item() here. >>> >>> So btrfs_run_delayed_items is run once before going into >>> create_pending_snapshots. This means that when we go into >>> create_pending_snapshots the filesystem should be consistent from the >>> POV of delayed items (as per patch's changelog). >>> >>> I somehow fail to reconcile how the removal of run delayed items (which >>> really pertain to inode state) have anything to do with an outdated >>> root? IMHO you are hitting some race which my patch just made more >>> visible due to removing code in the critical path. > > Not pretty sure about delayed items. > But if delayed items can cause delayed refs, then we need them run > before qgroup accounting. > >> >> I think my claim can be easily tested by re-instating the >> run_delayed_Refs call in create pending snapshot and hacking the >> function to return the number of delayed items processed. My expectation >> (on which my patch is based) is for this function to always return 0 in >> create_pending_snapshot. Will try to test this soon and report back. >> >> In the mean time, I'd like to hear Qu's opinion on your reported >> inconsistency since it seems to be the canonical problem he tried to >> initially fix in : >> >> 4d31778aa2fa ("btrfs: qgroup: Fix root item corruption when multiple >> same source snapshots are created with quota enabled") > > Since it's mostly related to delayed-item, I need some time to get > familiar with it before any really useful advice.
You can use this https://github.com/btrfs/btrfs-dev-docs/blob/master/delayed-inode.txt as a starting point, to greatly reduce the learning curve. On a quick look it seems delayed-inode infrastructure can indeed cause delayed refs, for example : __btrfs_run_delayed_items __btrfs_commit_inode_delayed_items btrfs_insert_delayed_items btrfs_insert_delayed_item btrfs_insert_empty_item <- goes into the allocator which creates delayed refs Or __btrfs_commit_inode_delayed_items btrfs_update_delayed_inode __btrfs_update_delayed_inode btrfs_lookup_inode <-- cals btrfs_search_slot with cow so will create delayed refs So the answer to your question is "yes", in which case indeed this patch will have to be reverted. > > Thanks, > Qu > >> >> >> >>> >>>> >>>> Thanks, >>>> Tomohiro Misono >>>> >>>> On 2018/02/13 23:16, Nikolay Borisov wrote: >>>>> Currently calls to btrfs_run_delayed_inode items have been scattered >>>>> around the transaction commit code with no real design argument when they >>>>> should be execute. >>>>> >>>>> We have one call, after transaction writers go to 0. Then we have the >>>>> delayed >>>>> items running as part of creating a snapshot (once per pedning snapshot). >>>>> Finally, delayed items are run once more _after_ snapshots have been >>>>> created. >>>>> All in all this amounts to 2 + N (N = number of snapshots slated for >>>>> creation). >>>>> In reality we need to flush delayed items once before >>>>> create_pending_snapshots is called to ensure snapshosts are consistent >>>>> with >>>>> inode data and once after snapshots are created so that newly introduced >>>>> inode >>>>> items during snapshot creation process are correctly persisted on disk. >>>>> This >>>>> patch brings the total executions of run_delayed_items to just 2. >>>>> >>>>> This survived multiple xfstest runs. >>>>> >>>>> Signed-off-by: Nikolay Borisov <nbori...@suse.com> >>>>> --- >>>>> fs/btrfs/transaction.c | 31 ++++++++++--------------------- >>>>> 1 file changed, 10 insertions(+), 21 deletions(-) >>>>> >>>>> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c >>>>> index 02bc1e6212e6..b32d3136f36c 100644 >>>>> --- a/fs/btrfs/transaction.c >>>>> +++ b/fs/btrfs/transaction.c >>>>> @@ -1524,18 +1524,6 @@ static noinline int create_pending_snapshot(struct >>>>> btrfs_trans_handle *trans, >>>>> } >>>>> btrfs_release_path(path); >>>>> >>>>> - /* >>>>> - * pull in the delayed directory update >>>>> - * and the delayed inode item >>>>> - * otherwise we corrupt the FS during >>>>> - * snapshot >>>>> - */ >>>>> - ret = btrfs_run_delayed_items(trans); >>>>> - if (ret) { /* Transaction aborted */ >>>>> - btrfs_abort_transaction(trans, ret); >>>>> - goto fail; >>>>> - } >>>>> - >>>>> record_root_in_trans(trans, root, 0); >>>>> btrfs_set_root_last_snapshot(&root->root_item, trans->transid); >>>>> memcpy(new_root_item, &root->root_item, sizeof(*new_root_item)); >>>>> @@ -2069,10 +2057,6 @@ int btrfs_commit_transaction(struct >>>>> btrfs_trans_handle *trans) >>>>> wait_event(cur_trans->writer_wait, >>>>> extwriter_counter_read(cur_trans) == 0); >>>>> >>>>> - /* some pending stuffs might be added after the previous flush. */ >>>>> - ret = btrfs_run_delayed_items(trans); >>>>> - if (ret) >>>>> - goto cleanup_transaction; >>>>> >>>>> btrfs_wait_delalloc_flush(fs_info); >>>>> >>>>> @@ -2095,6 +2079,16 @@ int btrfs_commit_transaction(struct >>>>> btrfs_trans_handle *trans) >>>>> ret = cur_trans->aborted; >>>>> goto scrub_continue; >>>>> } >>>>> + >>>>> + /* >>>>> + * Run delayed items because we need them to be consistent on-disk >>>>> + * so that snapshots created in create_pending_snapshots don't corrupt >>>>> + * the filesystem. At this point we are the one doing transaction >>>>> + * commit and now one can come and introduce new delayed inode items >>>>> + */ >>>>> + ret = btrfs_run_delayed_items(trans); >>>>> + if (ret) >>>>> + goto scrub_continue; >>>>> /* >>>>> * the reloc mutex makes sure that we stop >>>>> * the balancing code from coming in and moving >>>>> @@ -2102,11 +2096,6 @@ int btrfs_commit_transaction(struct >>>>> btrfs_trans_handle *trans) >>>>> */ >>>>> mutex_lock(&fs_info->reloc_mutex); >>>>> >>>>> - /* >>>>> - * We needn't worry about the delayed items because we will >>>>> - * deal with them in create_pending_snapshot(), which is the >>>>> - * core function of the snapshot creation. >>>>> - */ >>>>> ret = create_pending_snapshots(trans); >>>>> if (ret) { >>>>> mutex_unlock(&fs_info->reloc_mutex); >>>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>>> the body of a message to majord...@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html