Hi Filipe,

Thank you for the explanation.

On Sun, Dec 13, 2015 at 5:43 PM, Filipe Manana <fdman...@kernel.org> wrote:
> On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas <a...@zadarastorage.com> wrote:
>> Hi Filipe Manana,
>>
>> My understanding of selecting delayed refs to run or merging them is
>> far from complete. Can you please explain what will happen in the
>> following scenario:
>>
>> 1) Ref1 is created, as you explain
>> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
>> up with an EXTENT_ITEM and an inline extent back ref
>> 3) Ref2 and Ref3 are added
>> 4) Somebody calls __btrfs_run_delayed_refs()
>>
>> At this point, we cannot merge Ref2 and Ref3, because they might be
>> referencing tree blocks of completely different trees, thus
>> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
>> run, because we prefer BTRFS_ADD_DELAYED_REF over
>> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
>> now, because we already have Ref1 in the extent tree.
>
> No, that won't happen. If the ref (Ref3) is for a different tree, than
> it has a different inline extent from Ref1
> (lookup_inline_extent_backref returns -ENOENT and not 0).
Understood. So in this case, we will first add inline ref for Ref3,
and later drop the Ref1 inline ref via update_inline_extent_backref()
by truncating the EXTENT_ITEM. All in the same transaction.


>
> If they are all for the same tree it means Ref3 is not merged with
> Ref2 because they have different seq numbers and a seq value exist in
> fs_info->tree_mod_seq_list, and we skip Ref3 through
> btrfs_check_delayed_seq() until such seq number goes away from
> tree_mod_seq_list.
Ok, so we won't process this ref-head at all, until the "seq problem"
disappears.

> If no seq number exists in tree_mod_seq_list then
> we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
> running delayed refs, with Ref2 (which removes both refs since one is
> "-1" and the other "+1").
So in this case we don't care that the inline ref we have in the
EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for
the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully
equivalent. Interesting.

Thanks!
Alex.

>
> Iow, after this regression fix, no behaviour changed from releases before 4.2.
>
>>
>> So something should prevent us from running Ref3 before running Ref2.
>> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
>> the inline backref, and then run Ref3 to create a new backref with a
>> proper owner. What is that something?
>>
>> Can you please point me at what am I missing?
>>
>> Also, can such scenario happen in 3.18 kernel, which still has an
>> rbtree per ref-head? Looking at the code, I don't see anything
>> preventing that from happening.
>>
>> Thanks,
>> Alex.
>>
>>
>> On Sun, Oct 25, 2015 at 8:51 PM,  <fdman...@kernel.org> wrote:
>>> From: Filipe Manana <fdman...@suse.com>
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
>>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
>>> sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G        W   
>>>     4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>>> [35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti: 
>>> ffff88010c4c8000
>>> [35908.065201] RIP: 0010:[<ffffffffa04928b5>]  [<ffffffffa04928b5>] 
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:ffff88010c4cbb08  EFLAGS: 00010293
>>> [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 
>>> 0000000000000000
>>> [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 
>>> 0000000000000000
>>> [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: 
>>> ffff88010c4cb9f8
>>> [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 
>>> 0000000000000000
>>> [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 
>>> 0000000000000000
>>> [35908.065201] FS:  0000000000000000(0000) GS:ffff88023edc0000(0000) 
>>> knlGS:0000000000000000
>>> [35908.065201] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 
>>> 00000000000006e0
>>> [35908.065201] Stack:
>>> [35908.065201]  ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 
>>> ffff88015a408000
>>> [35908.065201]  ffff880154a44000 0000000000000000 0000000000000005 
>>> ffff88010c4cbbd8
>>> [35908.065201]  ffffffffa0492b9a 0000000000000005 0000000000000000 
>>> 0000000000000000
>>> [35908.065201] Call Trace:
>>> [35908.065201]  [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b 
>>> [btrfs]
>>> [35908.065201]  [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a [btrfs]
>>> [35908.065201]  [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14 
>>> [btrfs]
>>> [35908.065201]  [<ffffffff81063b23>] process_one_work+0x24a/0x4ac
>>> [35908.065201]  [<ffffffff81064285>] worker_thread+0x206/0x2c2
>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201]  [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb
>>> [35908.065201]  [<ffffffff8106904d>] kthread+0xef/0xf7
>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201]  [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70
>>> [35908.065201]  [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24
>>> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 
>>> 8d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 
>>> <0f> 0b 4c 8b 45 30 8b 4d 28 45 31
>>> [35908.065201] RIP  [<ffffffffa04928b5>] 
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201]  RSP <ffff88010c4cbb08>
>>> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>>>
>>> This happens because the new delayed references code no longer merges
>>> delayed references that have different sequence values. The following
>>> steps are an example sequence leading to this issue:
>>>
>>> 1) Transaction N starts, fs_info->tree_mod_seq has value 0;
>>>
>>> 2) Extent buffer (btree node) A is allocated, delayed reference Ref1 for
>>>    bytenr A is created, with a value of 1 and a seq value of 0;
>>
>> What happens if Ref1 is run at this point?
>>
>>>
>>> 3) fs_info->tree_mod_seq is incremented to 1;
>>>
>>> 4) Extent buffer A is deleted through btrfs_del_items(), which calls
>>>    btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The
>>>    later returns the metadata extent associated to extent buffer A to
>>>    the free space cache (the range is not pinned), because the extent
>>>    buffer was created in the current transaction (N) and writeback never
>>>    happened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN not set
>>>    in the extent buffer).
>>>    This creates the delayed reference Ref2 for bytenr A, with a value
>>>    of -1 and a seq value of 1;
>>>
>>> 5) Delayed reference Ref2 is not merged with Ref1 when we create it,
>>>    because they have different sequence numbers (decided at
>>>    add_delayed_ref_tail_merge());
>>>
>>> 6) fs_info->tree_mod_seq is incremented to 2;
>>>
>>> 7) Some task attempts to allocate a new extent buffer (done at
>>>    extent-tree.c:find_free_extent()), but due to heavy fragmentation
>>>    and running low on metadata space the clustered allocation fails
>>>    and we fall back to unclustered allocation, which finds the
>>>    extent at offset A, so a new extent buffer at offset A is allocated.
>>>    This creates delayed reference Ref3 for bytenr A, with a value of -1
>>>    and a seq value of 2;
>>>
>>> 8) Ref3 is not merged neither with Ref2 nor Ref1, again because they
>>>    all have different seq values;
>> Can Ref3 be merged with Ref1? They might be referencing blocks of
>> completely different trees (although using the same bytenr). So they
>> will have different root/parent and thus cannot be merged, I think.
>>
>>>
>>> 9) We start running the delayed references (__btrfs_run_delayed_refs());
>>>
>>> 10) The delayed Ref1 is the first one being applied, which ends up
>>>     creating an inline extent backref in the extent tree;
>>>
>>> 10) Next the delayed reference Ref3 is selected for execution, and not
>>>     Ref2, because select_delayed_ref() always gives a preference for
>>>     positive references (that have an action of BTRFS_ADD_DELAYED_REF);
>>>
>>> 11) When running Ref3 we encounter alreay the inline extent backref
>>>     in the extent tree at insert_inline_extent_backref(), which makes
>>>     us hit the following BUG_ON:
>>>
>>>         BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);
>>>
>>>     This is always true because owner corresponds to the level of the
>>>     extent buffer/btree node in the btree.
>>>
>>> For the scenario described above we hit the BUG_ON because we never merge
>>> references that have different seq values.
>>>
>>> We used to do the merging before the 4.2 kernel, more specifically, before
>>> the commmits:
>>>
>>>   c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in 
>>> ref_head.")
>>>   c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")
>>>
>>> This issue became more exposed after the following change that was added
>>> to 4.2 as well:
>>>
>>>   cffc3374e567 ("Btrfs: fix order by which delayed references are run")
>>>
>>> Which in turn fixed another regression by the two commits previously
>>> mentioned.
>>>
>>> So fix this by bringing back the delayed reference merge code, with the
>>> proper adaptations so that it operates against the new data structure
>>> (linked list vs old red black tree implementation).
>>>
>>> This issue was hit running fstest btrfs/063 in a loop. Several people have
>>> reported this issue in the mailing list when running on kernels 4.2+.
>>>
>>> Very special thanks to Stéphane Lesimple for helping debugging this issue
>>> and testing this fix on his multi terabyte filesystem (which took more
>>> than one day to balance alone, plus fsck, etc).
>>>
>>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root 
>>> in ref_head.")
>>> Reported-by: Peter Becker <floyd....@gmail.com>
>>> Reported-by: Stéphane Lesimple <stephane_bt...@lesimple.fr>
>>> Tested-by: Stéphane Lesimple <stephane_bt...@lesimple.fr>
>>> Reported-by: Malte Schröder <ma...@tnxip.de>
>>> Reported-by: Derek Dongray <de...@valedon.co.uk>
>>> Reported-by: Erkki Seppala <flux-bt...@inside.org>
>>> Cc: sta...@vger.kernel.org  # 4.2+
>>> Signed-off-by: Filipe Manana <fdman...@suse.com>
>>> ---
>>>
>>> V2: No code changes. Added Stéphane's Tested-by tag and made patch
>>>     part of a 2 fixes series, to reflect the other remaining fix
>>>     depends on this one.
>>> V3: No changes. Updated only the 2nd patch in the series to v3 (change log
>>>     updated).
>>>
>>>  fs/btrfs/delayed-ref.c | 113 
>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>  fs/btrfs/extent-tree.c |  14 ++++++
>>>  2 files changed, 127 insertions(+)
>>>
>>> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
>>> index bd9b63b..2f41580 100644
>>> --- a/fs/btrfs/delayed-ref.c
>>> +++ b/fs/btrfs/delayed-ref.c
>>> @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct 
>>> btrfs_trans_handle *trans,
>>>                 trans->delayed_ref_updates--;
>>>  }
>>>
>>> +static bool merge_ref(struct btrfs_trans_handle *trans,
>>> +                     struct btrfs_delayed_ref_root *delayed_refs,
>>> +                     struct btrfs_delayed_ref_head *head,
>>> +                     struct btrfs_delayed_ref_node *ref,
>>> +                     u64 seq)
>>> +{
>>> +       struct btrfs_delayed_ref_node *next;
>>> +       bool done = false;
>>> +
>>> +       next = list_first_entry(&head->ref_list, struct 
>>> btrfs_delayed_ref_node,
>>> +                               list);
>>> +       while (!done && &next->list != &head->ref_list) {
>>> +               int mod;
>>> +               struct btrfs_delayed_ref_node *next2;
>>> +
>>> +               next2 = list_next_entry(next, list);
>>> +
>>> +               if (next == ref)
>>> +                       goto next;
>>> +
>>> +               if (seq && next->seq >= seq)
>>> +                       goto next;
>>> +
>>> +               if (next->type != ref->type || next->no_quota != 
>>> ref->no_quota)
>>> +                       goto next;
>>> +
>>> +               if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> +                    ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&
>>> +                   comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),
>>> +                                  btrfs_delayed_node_to_tree_ref(next),
>>> +                                  ref->type))
>>> +                       goto next;
>>> +               if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY ||
>>> +                    ref->type == BTRFS_SHARED_DATA_REF_KEY) &&
>>> +                   comp_data_refs(btrfs_delayed_node_to_data_ref(ref),
>>> +                                  btrfs_delayed_node_to_data_ref(next)))
>>> +                       goto next;
>>> +
>>> +               if (ref->action == next->action) {
>>> +                       mod = next->ref_mod;
>>> +               } else {
>>> +                       if (ref->ref_mod < next->ref_mod) {
>>> +                               swap(ref, next);
>>> +                               done = true;
>>> +                       }
>>> +                       mod = -next->ref_mod;
>>> +               }
>>> +
>>> +               drop_delayed_ref(trans, delayed_refs, head, next);
>>> +               ref->ref_mod += mod;
>>> +               if (ref->ref_mod == 0) {
>>> +                       drop_delayed_ref(trans, delayed_refs, head, ref);
>>> +                       done = true;
>>> +               } else {
>>> +                       /*
>>> +                        * Can't have multiples of the same ref on a tree 
>>> block.
>>> +                        */
>>> +                       WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY ||
>>> +                               ref->type == BTRFS_SHARED_BLOCK_REF_KEY);
>>> +               }
>>> +next:
>>> +               next = next2;
>>> +       }
>>> +
>>> +       return done;
>>> +}
>>> +
>>> +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans,
>>> +                             struct btrfs_fs_info *fs_info,
>>> +                             struct btrfs_delayed_ref_root *delayed_refs,
>>> +                             struct btrfs_delayed_ref_head *head)
>>> +{
>>> +       struct btrfs_delayed_ref_node *ref;
>>> +       u64 seq = 0;
>>> +
>>> +       assert_spin_locked(&head->lock);
>>> +
>>> +       if (list_empty(&head->ref_list))
>>> +               return;
>>> +
>>> +       /* We don't have too many refs to merge for data. */
>>> +       if (head->is_data)
>>> +               return;
>>> +
>>> +       spin_lock(&fs_info->tree_mod_seq_lock);
>>> +       if (!list_empty(&fs_info->tree_mod_seq_list)) {
>>> +               struct seq_list *elem;
>>> +
>>> +               elem = list_first_entry(&fs_info->tree_mod_seq_list,
>>> +                                       struct seq_list, list);
>>> +               seq = elem->seq;
>>> +       }
>>> +       spin_unlock(&fs_info->tree_mod_seq_lock);
>>> +
>>> +       ref = list_first_entry(&head->ref_list, struct 
>>> btrfs_delayed_ref_node,
>>> +                              list);
>>> +       while (&ref->list != &head->ref_list) {
>>> +               if (seq && ref->seq >= seq)
>>> +                       goto next;
>>> +
>>> +               if (merge_ref(trans, delayed_refs, head, ref, seq)) {
>>> +                       if (list_empty(&head->ref_list))
>>> +                               break;
>>> +                       ref = list_first_entry(&head->ref_list,
>>> +                                              struct 
>>> btrfs_delayed_ref_node,
>>> +                                              list);
>>> +                       continue;
>>> +               }
>>> +next:
>>> +               ref = list_next_entry(ref, list);
>>> +       }
>>> +}
>>> +
>>>  int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
>>>                             struct btrfs_delayed_ref_root *delayed_refs,
>>>                             u64 seq)
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 92fdbc6..c0f30f5 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -2502,7 +2502,21 @@ static noinline int __btrfs_run_delayed_refs(struct 
>>> btrfs_trans_handle *trans,
>>>                         }
>>>                 }
>>>
>>> +               /*
>>> +                * We need to try and merge add/drops of the same ref since 
>>> we
>>> +                * can run into issues with relocate dropping the implicit 
>>> ref
>>> +                * and then it being added back again before the drop can
>>> +                * finish.  If we merged anything we need to re-loop so we 
>>> can
>>> +                * get a good ref.
>>> +                * Or we can get node references of the same type that 
>>> weren't
>>> +                * merged when created due to bumps in the tree mod seq, and
>>> +                * we need to merge them to prevent adding an inline extent
>>> +                * backref before dropping it (triggering a BUG_ON at
>>> +                * insert_inline_extent_backref()).
>>> +                */
>>>                 spin_lock(&locked_ref->lock);
>>> +               btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,
>>> +                                        locked_ref);
>>>
>>>                 /*
>>>                  * locked_ref is the head node, so we have to go one
>>> --
>>> 2.1.3
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to