Re: Major qgroup regression in 4.2?

Mark Fasheh Fri, 21 Aug 2015 16:32:11 -0700

On Fri, Aug 21, 2015 at 08:30:06AM +0800, Qu Wenruo wrote:
> Hi Mark,
> 
> Any further comment on the reason why we should mark all
> nodes/leaves dirty during a subvolume deletion?


No, I see what you're talking about so we'll just ahve to mark all nodes on
the way down the tree, not really a huge change from marking the shared
ones IMHO.

I'm currently trying to figure out how to get our deleted roots in the list
gathered by btrfs_qgroup_prepare_account_extents(). From what I can tell, at
some point btrfs_find_all_roots() will no longer be able to find the
subvolume being dropped, which results in an old_roots count that is off by
at least one. I'm pretty confident that if we can get the right number of
roots on our old_roots pointer, the math in qgroup_update_counters will work
out for us, for all levels of qgroups.
        --Mark


> 
> Thanks,
> Qu
> 
> Qu Wenruo wrote on 2015/08/18 16:03 +0800:
> >
> >
> >Qu Wenruo wrote on 2015/08/18 09:42 +0800:
> >>
> >>
> >>Mark Fasheh wrote on 2015/08/17 14:13 -0700:
> >>>Hi Qu,
> >>>
> >>>Firstly thanks for the response. I have a few new questions and comments
> >>>below,
> >>>
> >>>On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote:
> >>>>Thanks for pointing out the problem.
> >>>>But it's already known and we didn't have a good idea to solve it yet.
> >>>>
> >>>>BTW, the old framework won't handle it well either.
> >>>>Liu Bo's testcase (btrfs/017) can easily trigger it.
> >>>>So it's not a regression.
> >>>
> >>>I don't understand your meaning here. btrfs/017 doesn't have anything
> >>>to do
> >>>with subvolume deletion. The regression I am talking about is that
> >>>deleting
> >>>a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in
> >>>4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by
> >>>your
> >>>patches (thanks) but that's of course been at the expense of
> >>>reintroducing
> >>>the subvol regression :(
> >>Oh, it seems that my memory went wrong about the test case.
> >>
> >>But I did remember old implement has something wrong with subvolume
> >>deletion too.
> >>(Or maybe I'm wrong again?)
> >>>
> >>>
> >>>>Let me explain the problem a little and then introduce the fixing plan.
> >>>>
> >>>>[Qgroup for subvolume delete]
> >>>>Here are two subvolumes, sharing the whole tree.
> >>>>The whole trees are consistent with 8 tree blocks.
> >>>>A B are the tree root of subv 257 and 258 respectively.
> >>>>C~H are all shared by the two subvolumes.
> >>>>And I is one data extent which is recorded in tree block H.
> >>>>
> >>>>
> >>>>subv 257(A)    subv258(B)
> >>>>    |     \       /    |
> >>>>    C                  D
> >>>>   / \                / \
> >>>>   E F                G H
> >>>>                        :
> >>>>                        :
> >>>>                        I
> >>>>
> >>>>Let's assume the tree block are all in 4K size(just for easy
> >>>>calculation), and I is in 16K size.
> >>>>
> >>>>Now qgroup info should be:
> >>>>257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root)
> >>>>258: excl: 4K, rfer: 44K
> >>>>
> >>>>If we are going to delete subvolume 258, then expected qgroup should
> >>>>be:
> >>>>257: excl 44K, rfer 44K
> >>>
> >>>Ok, this is basically explaining how we expect the numbers from a
> >>>subvolume
> >>>delete to work out. Everything there looks normal to me.
> >>>
> >>>
> >>>>Which means we need to recalculate all the reference relation of
> >>>>tree block C~H and data extent I.
> >>>>
> >>>>[[The real problem]]
> >>>>Now the real problem is, we need to mark *ALL* metadata and data
> >>>>extent of a subvolume dirty to make subvolume deletion work.
> >>>
> >>>I don't understand why we have to mark all nodes of a subvol. The
> >>>previous
> >>>implementation was able to get this correct without recalculating every
> >>>block in the tree. It only passed shared nodes and items through to the
> >>>qgroup accounting code. The accounting code in turn had very simple
> >>>logic
> >>>for working it out - if the extent is now exclusively owned, we adjust
> >>>qgroup->excl for the (now only remaing) owner.
> >>Oh, you're right here.
> >>I'm stupid on this, as exclusive extent doesn't matter in the case.
> >>
> >>So no need to iterate the whole tree, but only shared extents.
> >>That's already a lot of extent we can skip.
> >Oh, I forgot one case involved with higher level qgroup.
> >            qg 1/0
> >            /    \
> >  subv 257(A)    subv258(B)
> >      |     \ /----    |
> >      C      D         J
> >     / \    / \ /------\
> >     E F    G H         L
> >              :
> >              :
> >              I
> >
> >The picture is hard to see BTW...
> >Explain a little.
> >qgroup 257 and qgroup 258 are in the same higher level qgroup pool, 1/0
> >And this time, suvol 257 and subvol 258 only shares part of there tree.
> >         A (subvol 257)
> >          /     \
> >     C    D
> >        / \     / \
> >       E   F   G   H ..... I
> >
> >         B (subvol 258)
> >          /     \
> >         C       J
> >        / \     / \
> >       E   F   G   L
> >Yes, one COW caused A->D->H route to be new B->J->L route in subvol 258.
> >
> >And lets take a look at the qgroup numbers:
> >Still use 4K tree block and I is 16K.
> >qg 257: excl: 28K rfer: 44K
> >meta excl: A, D, H: 3 * 4 = 12K
> >data excl: I = 16K
> >meta rfer: all 7 tree blocks = 28K
> >data rfer: I = 16K
> >
> >qg 258: excl: 12K rfer: 28K
> >meta excl: B, J, L: 3 * 4K = 12K
> >data excl: 0
> >meta rfer: all 7 tree blocks = 28K
> >data rfer: 0
> >
> >qg: 1/0: excl 56K rfer: 56K
> >meta excl: A ~ H, J~L, 10 * 4K = 40K
> >data excl: I = 16K
> >
> >Now, if only accounts shared nodes, which means only accounts
> >C, E, F, G tree blocks.
> >These 4 tree blocks will turn from shared to excl for subvol 257.
> >That's OK, as it won't cause anything wrong in qg 257.
> >
> >But that will cause qg 1/0 wrong.
> >And B, J, L is now owned by nobody, excl of qg 1/0 should be reduced by
> >12K.
> >As they are not marked as dirty since they are not shared,
> >so the result will be:
> >
> >qg 257: excl: 28 + 16 = 44K  rfer: 44K
> >qg 1/0: excl: 56K rfer: 56K
> >As related 4 tree blocks are still exel for qg 1/0.
> >Their change won't cause excl/rfer change.
> >
> >So the result of qg 1/0 is wrong. Which should be excl: 44K: rfer 44K,
> >just the same as qg 257.
> >
> >So, it is still needed to account all the nodes/leaves to ensure higher
> >level qgroup get correct reading.
> >
> >
> >Sorry for not pointing this out in previous reply, as qgroup is quite
> >complicated and I did forgot this case.
> >
> >So, I'm afraid we need to change the fix to mark all extents in the
> >subvolume to resolve the bug.
> >
> >Thanks,
> >Qu
> >
> >>
> >>>
> >>>If you go back to qgroup_subtree_accounting() it, you would see
> >>>*that it never* changes ->rfer on a qgroup. That is because the rfer
> >>>counts
> >>>on any related subvolumes don't change during subvol delete. Your
> >>>example above in
> >>>fact illustrates this.
> >>>
> >>>So to be clear,
> >>>
> >>>- Why do we have to visit all tree nodes with the qgroup code given that
> >>>we only cared about shared ones before"
> >>
> >>My fault, as I didn't jump out of the box and still using the idea of
> >>rescan whole tree to do it.
> >>
> >>>
> >>>
> >>>>Forgot to say, that's at transaction commit time, if we really do that,
> >>>>commit time consumption is not acceptable if the subvolume is huge.
> >>>
> >>>This too is confusing to me. Even if we assume that qgroups will get
> >>>every
> >>>node now instead of just the shared ones:
> >>>
> >>>- Drop subvol already visits every node (though it doesn't pass them
> >>>all to
> >>>the qgroup code). We were doing this before just fine to my knowledge
> >>>- what
> >>>changed? The commit-time qgroup accounting?
> >>
> >>Nothing, as even old implement does qgroup accounting at
> >>run_delayed_refs time, which is also done at commit_transaction.
> >>
> >>So that's just my meaningless worry.
> >>
> >>>
> >>>- Also, btrfs_drop_snapshot() throttles transaction commits (via
> >>>btrfs_end_transaction_throttle()), so anything done at commit time looks
> >>>like it would be broken up into smaller chunks of work for us.
> >>>
> >>>Am I mistaken in how I understand these functions?
> >>
> >>No, you're right.
> >>Overall my concern about commit-time qgroup accouting performance impact
> >>is meaningless now.
> >>
> >>>
> >>>
> >>>Also I must ask... How did you discover this performance issue? Are we
> >>>talking about something theoretical here or was your implementation
> >>>slow on
> >>>subvolume delete?
> >>Mainly theoretical.
> >>And theoretically, the new implement should be faster than old implement.
> >>
> >>But I'm not quite sure about the old implement is fast enough for
> >>subvolume deletion.
> >>Anyway, your fix seems no slower than old implement.
> >>So I'm completely OK with it now.
> >>
> >>>
> >>>
> >>>>[[Possible fix plan]]
> >>>>As you can see the biggest problem is the number of child
> >>>>metadata/data extents can be quite large, it's not possible to mark
> >>>>them all and account at commit transaction time.
> >>>
> >>>... right so handling qgroups at commit time is performance sensitive in
> >>>that too many qgroup updates will cause the commit code to do a lot of
> >>>work.
> >>>I think that actually answers one of my questions above, though my
> >>>followup would be:
> >>>
> >>>Would I be correct if I said this is a problem for any workload that
> >>>creates
> >>>a large number of qgroup updates in a short period of time?
> >>I'm afraid that's true.
> >>
> >>The typical one should be file clone.
> >>And that may be one of the worst case.
> >>
> >>As file clone will increase data extent backref, even inside the same
> >>subvolume, it will cause qgroup accounting.
> >>
> >>And it won't cause transaction to be committed, so in one transaction it
> >>can do file clone a lot of times.
> >>If all clone are happen on one extent, that's OK as qgroup only need to
> >>do one accounting.
> >>(BTW, only implement will do a lot of accounting even cloning the same
> >>extent)
> >>
> >>But if someone is doing file clone on different extents, that will be a
> >>disaster.
> >>
> >>Unlike other normal write, file clone won't cause much dirty page, so
> >>transaction won't be triggered by dirty page threshold.
> >>
> >>So after a lot of file cloning on different extents, next transaction
> >>commit will cost a lot of time.
> >>
> >>But that's my theoretical assumption, it may be totally wrong just like
> >>what I thought about subvolume deletion.
> >>
> >>>
> >>>
> >>>>But the good news is, we can keep the snapshot and account them in
> >>>>several commits.
> >>>>The new extent-orient accounting is quite accurate as it happens at
> >>>>commit time.
> >>>
> >>>Btw, one thing I should say is that in general I really like your
> >>>rewrite
> >>>grats on that - the code is far easier to read through and I like very
> >>>much
> >>>that we've taken some of the open-coded qgroup code out of our extent
> >>>inc/dec code.
> >>
> >>It's my pleasure. :)
> >>
> >>>
> >>>
> >>>>So one possible fix is, if we want to delete a subvolume, we put the
> >>>>subvolume into an orphan status, and let qgroup account to ignore
> >>>>that qgroup from then on, and mark some extent dirty in one
> >>>>transaction, and continue marking in next transaction until we mark
> >>>>all extents in that subvolume dirty.
> >>>>Then free the subvolume.
> >>>>
> >>>>That's the ideal method for both snapshot creation and deletion, but
> >>>>takes the most time.
> >>>
> >>>It also sounds like a disk format change which would really suck to tell
> >>>users to do just so they can get accurate qgroup numbers.
> >>
> >>With all your explain, I think your current fix is good enough, at least
> >>no worse than old implement.
> >>
> >>So I'm OK with it.
> >>As it fixes a regression without major performance drop.
> >>
> >>We can improve performance later if it's really needed.
> >>
> >>>
> >>>
> >>>>[Hot fix idea]
> >>>>So far, the best and easiest method I come up with is, if we found
> >>>>qgroup is enabled and the snapshot to delete is above level 1(level
> >>>>starts from 0), then mark the QGROUP_INCONSISTENT flag to info user
> >>>>to do a rescan.
> >>>
> >>>This is exactly the kind of band-aid solution we wanted to avoid the
> >>>first
> >>>time qgroups and subvolume handling were fixed.
> >>>    --Mark
> >>With your fix, the quick and dirty one is not needed anyway.
> >>
> >>Good job.
> >>
> >>Thanks,
> >>Qu
> >>>
> >>>--
> >>>Mark Fasheh
> >>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
Mark Fasheh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major qgroup regression in 4.2?

Reply via email to