On Tue, Jul 26, 2016 at 11:04 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> At 07/26/2016 05:28 PM, Filipe Manana wrote:
>>
>> On Tue, Jul 26, 2016 at 1:57 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>> wrote:
>>>
>>>
>>>
>>> At 07/25/2016 09:48 PM, Filipe Manana wrote:
>>>>
>>>>
>>>> On Mon, Jul 25, 2016 at 8:19 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> This patch will disable clone detection of send.
>>>>>
>>>>> The main problem of clone detetion in send is its memory usage and long
>>>>> execution time.
>>>>>
>>>>> The clone detection is done by iterating all backrefs and adding
>>>>> backref
>>>>> whose root is the source.
>>>>>
>>>>> However iterating all backrefs is already quite a bad idea, we should
>>>>> never try to do it in a loop, and unfortunately in-band/out-of-band and
>>>>> reflink can easily create a file whose file extents are point to the
>>>>> same extent.
>>>>>
>>>>> In that case, btrfs will do backref walk for the same extent again and
>>>>> again, until either OOM or soft lockup is triggered.
>>>>>
>>>>> So disabling clone detection until we find a method that iterates
>>>>> backrefs of one extent only once, just like what balance/qgroup is
>>>>> doing.
>>>>
>>>>
>>>>
>>>> Is this really a common scenario?
>>>
>>>
>>>
>>> If any user can create such file without root privilege, and takes kernel
>>> CPU for a long time, no matter if it's a common scenario, it's a big
>>> problem.
>>
>>
>> He can also cause large cpu consumption by making an extent shared
>> many many times. There's a recent generic fstests case that exercises
>> that (keeps doing reflinks until ENOSPC happens).
>
>
> Which one?
> And if you're talking about generic/175, then btrfs is not that slow.
> Btrfsck is the problem.

generic/333 and generic/334.

>
>> Shall we disable reflink support too? (clone and extent_same ioctls)
>
>
> No, you didn't ever understand the problem.
> Unlike reflink, which won't ever do backref walk on the same extent.

I'm giving you an example where too many references cause a slowdown.
I'm not telling you that it's caused by backref walking. It does not
matter what causes it,
the point is if should we completely disable some feature if it has
performance problems for some edge case(s)?
Being the example shared extents/reflinks here.

> Its execution time is only affected by the size of extent and fs tree.
>
> Completely different from the cubic growth which should be avoided.
>
>>
>>>
>>>> I don't recall ever seeing a report of such slow down or oom due to
>>>> the same file pointing to the same extent thousands of times.
>>>
>>>
>>> Then, check the test case I submitted days ago. No dedupe involved at
>>> all.
>>> IIRC, I have Cced you.
>>>
>>> https://patchwork.kernel.org/patch/9235825/
>>>
>>> And, FIEMAP ioctl has the same problem, test case is already merged into
>>> fstests, check generic/352.
>>
>>
>> Yes, I saw that.
>>
>>>
>>>> Sure it's easy to create such scenario with artificial test data for
>>>> our inband deduplication feature (and we should really strive to make
>>>> what we have stable and reliable rather than add yet more features).
>>>
>>>
>>>
>>> This can be triggered even without the dedupe. Just check the test case.
>>
>>
>> I have been around long enough to realize we can get shared extents
>> without the dedupe ioctl...
>> What I'm telling you is that instead of adding yet more ways for a
>> user to shoot himself/herself in the foot, we should fix first the
>> core problems.
>
>
> Dedupe is completely unrelated with this bug.
> The problem will exist no matter there is dedupe or not.


Yes, I know and I told you that before, I'm still aware that we can
get shared extents without using dedupe....

>
> I'm surprised that some one would blame unrelated code just because it can
> cause some existing bugs.
>
> And without the test of dedupe(inband or out-of-band), we know how long the
> problem will still exist?
>
> And for your so called "core" fix, of course you can try your best to do a
> total rework of backref walk code.
> Adding cache or whatever you can thing of to speedup the backref walk.
>
> But it will mostly end up as a failure.
>
> As long as btrfs does snapshot by cowing its tree root, cache is not
> possible at all.
>
> Just face the fact, you SHOULD NEVER call find_parent_nodes() on each file
> extent you find.
> This multiples O(n) to the existing find_parent_nodes().
>
> Even we find a way to reduce execution time of find_parent_nodes() from
> O(n^2) to O(n) (Already impossible though), calling it unconditionally on
> every extent will still leads to O(n^2).
>
> The total execution time of send is consist of:
> Send execution time = Number of file extents * find_parent_nodes() time
>
> The main problem is the "Number of file extents".

Qu, I got that in your patch's change log.
You are still missing the point in my replies, see below.

> You can reduce find_parent_nodes(). But even you reduce it from 1.6s to
> 0.16s.
> The total time just reduce from 5000+ sec to 500+sec.
> Still no really helped.
>
> You should reduce the "number of file extents" to O(1), or at least O(logN).
> (Just like qgroup, build a rb tree to record which extent has been iterated,
> then it could be fixed much easier)
> This part is not backref walk code, and send is completely the cause.
>
>
> Instead of complaining the slow backref walk, why not do the current best
> fix to avoid calling unneeded find_parent_nodes() in send?

You got a whole different interpretation of my reply.
What I've been trying to tell you is that trying to disable things
because of performance problems for edge cases is not always a good
idea,
and that different variants of the problem exist.

>>
>>
>>
>>>
>>> And you should yell at out-of-band dedupe first than in-band, because
>>> with
>>> out-of-band dedupe, all these bugs can be triggered for a long time, but
>>> no
>>> one really cares until it's exposed by in-band dedupe.
>>
>>
>> Just because we already have ways to get into a problem, then it's ok
>> to add yet one more way to get into the problem?
>> Doesn't make a lot of sense to me.
>
>
> It also doesn't make sense to blame a unrelated function just because there
> is existing bugs.
>
>>
>>>
>>> The only thing dedupe get involved is, it makes us know there is still a
>>> lot
>>> of bugs that no one really cares before.
>>
>>
>> Or you think no one cares about.
>
>
> So other ones cares it so much that no one even send out a test case?

We've had features added and bug fixes without test cases for a long time.
Does everyone who cares about a particular feature also keeps adding
tests for the cases where the feature works correctly atm?
And people might have not had the time yet to do it...

>
>
>>
>>>
>>>>
>>>> What needs to be fixed here is the common backref walking code, called
>>>> by send plus a few other places (fiemap and some ioctls at least), for
>>>> which IIRC there was
>>>> some recent patch from one of your colleagues.
>>>
>>>
>>>
>>> Yes, Lu's fix for fiemap is OK, but that won't really fix the problem.
>>> The fix is only for fiemap, as it just do a early exit, instead of always
>>> do
>>> a backref walk.
>>>
>>> There are several backref callers, including:
>>> 1) balance
>>> 2) qgroup
>>> 3) fiemap (check_shared)
>>> 4) send
>>>
>>> But for the same reflinked file, 1) and 2) won't cause such long
>>> execution
>>> time, just because balance and qgroup will at most call
>>> find_parent_nodes()
>>> on the same *EXTENT* twice.
>>>
>>> However for old fiemap, or current send, they call find_parent_nodes() on
>>> every *FILE* *EXTENT*, which points to the same *EXTENT*.
>>>
>>> And according to my ftrace, for one extent shared by 4096 times, it will
>>> takes about 1.6sec to execute find_parent_nodes().
>>>
>>> So for balance and qgroup, extra 3.2 seconds won't be a big problem.
>>> (Although mix balance and qgroup is another problem)
>>>
>>> But for fiemap or send, it will be 4096 * 1.6 = 6553 secs, definitely not
>>> the right thing.
>>>
>>>>
>>>> Disabling this code makes the problem go away for the scenario of the
>>>> same file pointing to the same extent thousands of times (or tens or
>>>> hundreds of thousands, whatever).
>>>> But what happens if a file points to an extent once but the extent is
>>>> referenced by other 100k files (possibly in different snapshots or
>>>> subvolumes), isn't it the same problem?
>>>
>>>
>>>
>>> No, not the same problem.
>>> In that case, the extent will only be iterated once, just 1.6sec.
>>
>>
>> And you have many different extents for which it takes 1.6 seconds,
>> the problem ends up being the same - long execution time.
>> What I'm trying to tell you is that, with shared extents, you can get
>> into long execution times in many different ways. Be it with
>> the same file pointing to the same extent thousands of time or
>> pointing to a lot of different extents that happen to be shared with
>> many other files/snapshots.
>
>
> Yes, shared extents will slow backref, but that's not the excuse to make
> things even worse.
>
> One extent shared by 1024 times, and all belongs to one file extent.
> And on the other hand, it's shared by 2 files, 512 times respectively.
>
> The former will make send much much slower than the latter.
> While balance/qgroup/new fiemap can handle it without problem.
>
> Just face it.
> Send clone detection has problem.

Never said it hadn't.

>>
>>
>>>
>>> Nothing will be wrong at all.
>>>
>>> The only wrong thing is, send lacks the method to avoid calling
>>> find_parent_nodes() again and again on the same extent.
>>>
>>> And yes, we can fix it in backref code, but now send is the only caller
>>> which may call find_parent_nodes() on the same extent again and again.
>>>
>>> Qgroup has been changed from call find_parent_nodes() every time to only
>>> call it twice.
>>> Balance itself is already extent based, nothing to do with the bug.
>>> And fiemap adds early quit.
>>>
>>> Only send is left here.
>>>
>>>> The backref code has to find
>>>> all such inodes/offsets/roots and take again the same long time...
>>>>
>>>> Following your logic, it seems like everything that uses the backref
>>>> walking code should be disabled.
>>>
>>>
>>> Just check my previous statement.
>>>
>>> It's not about execution time of find_parent_nodes() and its variants,
>>> but
>>> the caller behavior.
>>>
>>> If caller only call find_parent_nodes() once for one extent, it's just
>>> less
>>> than 2 seconds.(almost linear growth with the number of reference)
>>>
>>> While caller call find_parent_nodes() on every file extent it finds, then
>>> there is the problem. (squared to cubic growth)
>>
>>
>> Even if those extent items point to different extents that are shared
>> between many files/snapshots.
>
>
> But in that case, find_parent_nodes() will just be called once.

Just called once for each different physical extent.
If it has to be called once for too many different physical extents,
it will also be slow.
If some one keeps calling the ioctls that end up doing backref walking
in a loop, or doing thousands of them in parallel, you also see a
negative impact...

Some of these problems, which you only found out recently, have been
known for quite some time and reported by Zygo Blaxwell on IRC for
example.

> Just extra several seconds.
> (Unless the other file extents inside the send source root also refers to
> it)
>
> I'm so surprised that one put so much effort to make btrfs stable would
> refuse to see the existing bugs of send, even other backref walk callers
> find their method to solve the problem.

Where did I say that send has no problems?

All I have been saying to you is that there are many similar problems
using shared extents and backref walking.
And just disabling cloning for all cases is extreme when only a few
rare cases cause problems.


>
> Thanks,
> Qu
>
>
>>
>>>
>>>
>>> And, I know this is a quite bad idea to disable clone detection, so I
>>> send
>>> this patch as RFC.
>>> Just to raise the concern of any find_paren_nodes() caller, and to find a
>>> good method to fix the bug.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>>>
>>>>> Cc: Filipe Manana <fdman...@gmail.com>
>>>>> Reported-by: Tsutomu Itoh <t-i...@jp.fujitsu.com>
>>>>> Signed-off-by: Qu Wenruo <quwen...@cn.fujitsu.com>
>>>>> ---
>>>>>  fs/btrfs/send.c | 22 ++++++++++++++++++++--
>>>>>  1 file changed, 20 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
>>>>> index 2db8dc8..eed3f1c 100644
>>>>> --- a/fs/btrfs/send.c
>>>>> +++ b/fs/btrfs/send.c
>>>>> @@ -1166,6 +1166,7 @@ struct backref_ctx {
>>>>>         int found_itself;
>>>>>  };
>>>>>
>>>>> +#if 0
>>>>>  static int __clone_root_cmp_bsearch(const void *key, const void *elt)
>>>>>  {
>>>>>         u64 root = (u64)(uintptr_t)key;
>>>>> @@ -1177,6 +1178,7 @@ static int __clone_root_cmp_bsearch(const void
>>>>> *key, const void *elt)
>>>>>                 return 1;
>>>>>         return 0;
>>>>>  }
>>>>> +#endif
>>>>>
>>>>>  static int __clone_root_cmp_sort(const void *e1, const void *e2)
>>>>>  {
>>>>> @@ -1190,6 +1192,7 @@ static int __clone_root_cmp_sort(const void *e1,
>>>>> const void *e2)
>>>>>         return 0;
>>>>>  }
>>>>>
>>>>> +#if 0
>>>>>  /*
>>>>>   * Called for every backref that is found for the current extent.
>>>>>   * Results are collected in sctx->clone_roots->ino/offset/found_refs
>>>>> @@ -1445,6 +1448,7 @@ out:
>>>>>         kfree(backref_ctx);
>>>>>         return ret;
>>>>>  }
>>>>> +#endif
>>>>>
>>>>>  static int read_symlink(struct btrfs_root *root,
>>>>>                         u64 ino,
>>>>> @@ -5291,7 +5295,6 @@ static int process_extent(struct send_ctx *sctx,
>>>>>                           struct btrfs_path *path,
>>>>>                           struct btrfs_key *key)
>>>>>  {
>>>>> -       struct clone_root *found_clone = NULL;
>>>>>         int ret = 0;
>>>>>
>>>>>         if (S_ISLNK(sctx->cur_inode_mode))
>>>>> @@ -5333,12 +5336,27 @@ static int process_extent(struct send_ctx
>>>>> *sctx,
>>>>>                 }
>>>>>         }
>>>>>
>>>>> +       /*
>>>>> +        * Current clone detection is both time and memory consuming.
>>>>> +        *
>>>>> +        * Time consuming is caused by iterating all backref of extent.
>>>>> +        * Memory consuming is caused by allocating "found_clone" every
>>>>> +        * time for a backref.
>>>>> +        *
>>>>> +        * XXX: Disabling it is never the best method, but at least it
>>>>> +        * won't cause OOM nor super long execution time.
>>>>> +        * The root fix needs to change the iteration basis, from
>>>>> iterating
>>>>> +        * file extents to iterating extents, so find_parent_nodes()
>>>>> and
>>>>> +        * backref walk should be called only once for one extent.
>>>>> +        */
>>>>> +#if 0
>>>>>         ret = find_extent_clone(sctx, path, key->objectid, key->offset,
>>>>>                         sctx->cur_inode_size, &found_clone);
>>>>>         if (ret != -ENOENT && ret < 0)
>>>>>                 goto out;
>>>>> +#endif
>>>>>
>>>>> -       ret = send_write_or_clone(sctx, path, key, found_clone);
>>>>> +       ret = send_write_or_clone(sctx, path, key, NULL);
>>>>>         if (ret)
>>>>>                 goto out;
>>>>>  out_hole:
>>>>> --
>>>>> 2.9.0
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to