On Fri, Aug 21, 2015 at 08:30:06AM +0800, Qu Wenruo wrote: > Hi Mark, > > Any further comment on the reason why we should mark all > nodes/leaves dirty during a subvolume deletion?
No, I see what you're talking about so we'll just ahve to mark all nodes on the way down the tree, not really a huge change from marking the shared ones IMHO. I'm currently trying to figure out how to get our deleted roots in the list gathered by btrfs_qgroup_prepare_account_extents(). From what I can tell, at some point btrfs_find_all_roots() will no longer be able to find the subvolume being dropped, which results in an old_roots count that is off by at least one. I'm pretty confident that if we can get the right number of roots on our old_roots pointer, the math in qgroup_update_counters will work out for us, for all levels of qgroups. --Mark > > Thanks, > Qu > > Qu Wenruo wrote on 2015/08/18 16:03 +0800: > > > > > >Qu Wenruo wrote on 2015/08/18 09:42 +0800: > >> > >> > >>Mark Fasheh wrote on 2015/08/17 14:13 -0700: > >>>Hi Qu, > >>> > >>>Firstly thanks for the response. I have a few new questions and comments > >>>below, > >>> > >>>On Mon, Aug 17, 2015 at 09:33:54AM +0800, Qu Wenruo wrote: > >>>>Thanks for pointing out the problem. > >>>>But it's already known and we didn't have a good idea to solve it yet. > >>>> > >>>>BTW, the old framework won't handle it well either. > >>>>Liu Bo's testcase (btrfs/017) can easily trigger it. > >>>>So it's not a regression. > >>> > >>>I don't understand your meaning here. btrfs/017 doesn't have anything > >>>to do > >>>with subvolume deletion. The regression I am talking about is that > >>>deleting > >>>a subvolume in btrfs on 4.2 breaks qgroup numbers, whereas it did not in > >>>4.1. Btw, I do acknowledge that 017 *was* broken before and fixed by > >>>your > >>>patches (thanks) but that's of course been at the expense of > >>>reintroducing > >>>the subvol regression :( > >>Oh, it seems that my memory went wrong about the test case. > >> > >>But I did remember old implement has something wrong with subvolume > >>deletion too. > >>(Or maybe I'm wrong again?) > >>> > >>> > >>>>Let me explain the problem a little and then introduce the fixing plan. > >>>> > >>>>[Qgroup for subvolume delete] > >>>>Here are two subvolumes, sharing the whole tree. > >>>>The whole trees are consistent with 8 tree blocks. > >>>>A B are the tree root of subv 257 and 258 respectively. > >>>>C~H are all shared by the two subvolumes. > >>>>And I is one data extent which is recorded in tree block H. > >>>> > >>>> > >>>>subv 257(A) subv258(B) > >>>> | \ / | > >>>> C D > >>>> / \ / \ > >>>> E F G H > >>>> : > >>>> : > >>>> I > >>>> > >>>>Let's assume the tree block are all in 4K size(just for easy > >>>>calculation), and I is in 16K size. > >>>> > >>>>Now qgroup info should be: > >>>>257: excl: 4K, rfer: 44K (6 shared tree block + 1 data + 1 tree root) > >>>>258: excl: 4K, rfer: 44K > >>>> > >>>>If we are going to delete subvolume 258, then expected qgroup should > >>>>be: > >>>>257: excl 44K, rfer 44K > >>> > >>>Ok, this is basically explaining how we expect the numbers from a > >>>subvolume > >>>delete to work out. Everything there looks normal to me. > >>> > >>> > >>>>Which means we need to recalculate all the reference relation of > >>>>tree block C~H and data extent I. > >>>> > >>>>[[The real problem]] > >>>>Now the real problem is, we need to mark *ALL* metadata and data > >>>>extent of a subvolume dirty to make subvolume deletion work. > >>> > >>>I don't understand why we have to mark all nodes of a subvol. The > >>>previous > >>>implementation was able to get this correct without recalculating every > >>>block in the tree. It only passed shared nodes and items through to the > >>>qgroup accounting code. The accounting code in turn had very simple > >>>logic > >>>for working it out - if the extent is now exclusively owned, we adjust > >>>qgroup->excl for the (now only remaing) owner. > >>Oh, you're right here. > >>I'm stupid on this, as exclusive extent doesn't matter in the case. > >> > >>So no need to iterate the whole tree, but only shared extents. > >>That's already a lot of extent we can skip. > >Oh, I forgot one case involved with higher level qgroup. > > qg 1/0 > > / \ > > subv 257(A) subv258(B) > > | \ /---- | > > C D J > > / \ / \ /------\ > > E F G H L > > : > > : > > I > > > >The picture is hard to see BTW... > >Explain a little. > >qgroup 257 and qgroup 258 are in the same higher level qgroup pool, 1/0 > >And this time, suvol 257 and subvol 258 only shares part of there tree. > > A (subvol 257) > > / \ > > C D > > / \ / \ > > E F G H ..... I > > > > B (subvol 258) > > / \ > > C J > > / \ / \ > > E F G L > >Yes, one COW caused A->D->H route to be new B->J->L route in subvol 258. > > > >And lets take a look at the qgroup numbers: > >Still use 4K tree block and I is 16K. > >qg 257: excl: 28K rfer: 44K > >meta excl: A, D, H: 3 * 4 = 12K > >data excl: I = 16K > >meta rfer: all 7 tree blocks = 28K > >data rfer: I = 16K > > > >qg 258: excl: 12K rfer: 28K > >meta excl: B, J, L: 3 * 4K = 12K > >data excl: 0 > >meta rfer: all 7 tree blocks = 28K > >data rfer: 0 > > > >qg: 1/0: excl 56K rfer: 56K > >meta excl: A ~ H, J~L, 10 * 4K = 40K > >data excl: I = 16K > > > >Now, if only accounts shared nodes, which means only accounts > >C, E, F, G tree blocks. > >These 4 tree blocks will turn from shared to excl for subvol 257. > >That's OK, as it won't cause anything wrong in qg 257. > > > >But that will cause qg 1/0 wrong. > >And B, J, L is now owned by nobody, excl of qg 1/0 should be reduced by > >12K. > >As they are not marked as dirty since they are not shared, > >so the result will be: > > > >qg 257: excl: 28 + 16 = 44K rfer: 44K > >qg 1/0: excl: 56K rfer: 56K > >As related 4 tree blocks are still exel for qg 1/0. > >Their change won't cause excl/rfer change. > > > >So the result of qg 1/0 is wrong. Which should be excl: 44K: rfer 44K, > >just the same as qg 257. > > > >So, it is still needed to account all the nodes/leaves to ensure higher > >level qgroup get correct reading. > > > > > >Sorry for not pointing this out in previous reply, as qgroup is quite > >complicated and I did forgot this case. > > > >So, I'm afraid we need to change the fix to mark all extents in the > >subvolume to resolve the bug. > > > >Thanks, > >Qu > > > >> > >>> > >>>If you go back to qgroup_subtree_accounting() it, you would see > >>>*that it never* changes ->rfer on a qgroup. That is because the rfer > >>>counts > >>>on any related subvolumes don't change during subvol delete. Your > >>>example above in > >>>fact illustrates this. > >>> > >>>So to be clear, > >>> > >>>- Why do we have to visit all tree nodes with the qgroup code given that > >>>we only cared about shared ones before" > >> > >>My fault, as I didn't jump out of the box and still using the idea of > >>rescan whole tree to do it. > >> > >>> > >>> > >>>>Forgot to say, that's at transaction commit time, if we really do that, > >>>>commit time consumption is not acceptable if the subvolume is huge. > >>> > >>>This too is confusing to me. Even if we assume that qgroups will get > >>>every > >>>node now instead of just the shared ones: > >>> > >>>- Drop subvol already visits every node (though it doesn't pass them > >>>all to > >>>the qgroup code). We were doing this before just fine to my knowledge > >>>- what > >>>changed? The commit-time qgroup accounting? > >> > >>Nothing, as even old implement does qgroup accounting at > >>run_delayed_refs time, which is also done at commit_transaction. > >> > >>So that's just my meaningless worry. > >> > >>> > >>>- Also, btrfs_drop_snapshot() throttles transaction commits (via > >>>btrfs_end_transaction_throttle()), so anything done at commit time looks > >>>like it would be broken up into smaller chunks of work for us. > >>> > >>>Am I mistaken in how I understand these functions? > >> > >>No, you're right. > >>Overall my concern about commit-time qgroup accouting performance impact > >>is meaningless now. > >> > >>> > >>> > >>>Also I must ask... How did you discover this performance issue? Are we > >>>talking about something theoretical here or was your implementation > >>>slow on > >>>subvolume delete? > >>Mainly theoretical. > >>And theoretically, the new implement should be faster than old implement. > >> > >>But I'm not quite sure about the old implement is fast enough for > >>subvolume deletion. > >>Anyway, your fix seems no slower than old implement. > >>So I'm completely OK with it now. > >> > >>> > >>> > >>>>[[Possible fix plan]] > >>>>As you can see the biggest problem is the number of child > >>>>metadata/data extents can be quite large, it's not possible to mark > >>>>them all and account at commit transaction time. > >>> > >>>... right so handling qgroups at commit time is performance sensitive in > >>>that too many qgroup updates will cause the commit code to do a lot of > >>>work. > >>>I think that actually answers one of my questions above, though my > >>>followup would be: > >>> > >>>Would I be correct if I said this is a problem for any workload that > >>>creates > >>>a large number of qgroup updates in a short period of time? > >>I'm afraid that's true. > >> > >>The typical one should be file clone. > >>And that may be one of the worst case. > >> > >>As file clone will increase data extent backref, even inside the same > >>subvolume, it will cause qgroup accounting. > >> > >>And it won't cause transaction to be committed, so in one transaction it > >>can do file clone a lot of times. > >>If all clone are happen on one extent, that's OK as qgroup only need to > >>do one accounting. > >>(BTW, only implement will do a lot of accounting even cloning the same > >>extent) > >> > >>But if someone is doing file clone on different extents, that will be a > >>disaster. > >> > >>Unlike other normal write, file clone won't cause much dirty page, so > >>transaction won't be triggered by dirty page threshold. > >> > >>So after a lot of file cloning on different extents, next transaction > >>commit will cost a lot of time. > >> > >>But that's my theoretical assumption, it may be totally wrong just like > >>what I thought about subvolume deletion. > >> > >>> > >>> > >>>>But the good news is, we can keep the snapshot and account them in > >>>>several commits. > >>>>The new extent-orient accounting is quite accurate as it happens at > >>>>commit time. > >>> > >>>Btw, one thing I should say is that in general I really like your > >>>rewrite > >>>grats on that - the code is far easier to read through and I like very > >>>much > >>>that we've taken some of the open-coded qgroup code out of our extent > >>>inc/dec code. > >> > >>It's my pleasure. :) > >> > >>> > >>> > >>>>So one possible fix is, if we want to delete a subvolume, we put the > >>>>subvolume into an orphan status, and let qgroup account to ignore > >>>>that qgroup from then on, and mark some extent dirty in one > >>>>transaction, and continue marking in next transaction until we mark > >>>>all extents in that subvolume dirty. > >>>>Then free the subvolume. > >>>> > >>>>That's the ideal method for both snapshot creation and deletion, but > >>>>takes the most time. > >>> > >>>It also sounds like a disk format change which would really suck to tell > >>>users to do just so they can get accurate qgroup numbers. > >> > >>With all your explain, I think your current fix is good enough, at least > >>no worse than old implement. > >> > >>So I'm OK with it. > >>As it fixes a regression without major performance drop. > >> > >>We can improve performance later if it's really needed. > >> > >>> > >>> > >>>>[Hot fix idea] > >>>>So far, the best and easiest method I come up with is, if we found > >>>>qgroup is enabled and the snapshot to delete is above level 1(level > >>>>starts from 0), then mark the QGROUP_INCONSISTENT flag to info user > >>>>to do a rescan. > >>> > >>>This is exactly the kind of band-aid solution we wanted to avoid the > >>>first > >>>time qgroups and subvolume handling were fixed. > >>> --Mark > >>With your fix, the quick and dirty one is not needed anyway. > >> > >>Good job. > >> > >>Thanks, > >>Qu > >>> > >>>-- > >>>Mark Fasheh > >>> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Mark Fasheh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html