Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Thu, Jul 28, 2016 at 02:18:29PM -0700, Junio C Hamano wrote: > Kirill Smelkov writes: > > > I'm waiting so long for main patch to be at least queued to pu, that I'm > > now a bit frustrated and ready to do something not related to main goal :) > > Perhaps the first step would be to stop putting multiple patches in > a single e-mail buried after a few pages of discussion. I will not > even find that there _are_ multiple patches in the message if I am > not involved directly in the discussion, and the discussion is still > ongoing, because it is likely that I'd skim just a few paragraphs at > the top before going on to other messages. > > I won't touch the message I am responding to, as your -- 8< -- cut > mark does not even seem to be a reliable marker between patches > (i.e. I see something like this that is clearly not a message > boundary: > > than `git pack-objects file.pack`. Extracting erp5.git pack from > lab.nexedi.com backup repository: > > 8< > $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack > > real0m22.309s > ... > ) Ok, makes sense and my fault. I'm resending each patch as separate message in reply to this mail. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
Kirill Smelkov writes: > I'm waiting so long for main patch to be at least queued to pu, that I'm > now a bit frustrated and ready to do something not related to main goal :) Perhaps the first step would be to stop putting multiple patches in a single e-mail buried after a few pages of discussion. I will not even find that there _are_ multiple patches in the message if I am not involved directly in the discussion, and the discussion is still ongoing, because it is likely that I'd skim just a few paragraphs at the top before going on to other messages. I won't touch the message I am responding to, as your -- 8< -- cut mark does not even seem to be a reliable marker between patches (i.e. I see something like this that is clearly not a message boundary: than `git pack-objects file.pack`. Extracting erp5.git pack from lab.nexedi.com backup repository: 8< $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack real0m22.309s ... ) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
Junio, first of all thanks for feedback, On Wed, Jul 27, 2016 at 01:40:36PM -0700, Junio C Hamano wrote: > Kirill Smelkov writes: > > > > From: Kirill Smelkov > > Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active > > under > > --local or --honor-pack-keep > > > > Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there > > are two codepaths in pack-objects: with & without using bitmap > > reachability index. > > > > However add_object_entry_from_bitmap(), despite its non-bitmapped > > counterpart add_object_entry(), in no way does check for whether --local > > or --honor-pack-keep should be respected. In non-bitmapped codepath this > > is handled in want_object_in_pack(), but bitmapped codepath has simply > > no such checking at all. > > > > The bitmapped codepath however was allowing to pass --local and > > --honor-pack-keep and bitmap indices were still used under such > > conditions - potentially giving wrong output (including objects from > > non-local or .keep'ed pack). > > > > Instead of fixing bitmapped codepath to respect those options, since > > currently no one actually need or use them in combination with bitmaps, > > let's just force use_bitmap_index=0 when any of --local or > > --honor-pack-keep are used and add appropriate comment about > > not-checking for those in add_object_entry_from_bitmap() > > > > Suggested-by: Jeff King > > --- > > builtin/pack-objects.c | 15 +++ > > 1 file changed, 15 insertions(+) > > > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > > index 15866d7..d7cf782 100644 > > --- a/builtin/pack-objects.c > > +++ b/builtin/pack-objects.c > > @@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const > > unsigned char *sha1, > > if (have_duplicate_entry(sha1, 0, &index_pos)) > > return 0; > > > > + /* > > +* for simplicity we always want object to be in pack, as > > +* use_bitmap_index codepath assumes neither --local nor > > --honor-pack-keep > > +* is active. > > +*/ > > I am not sure this comment is useful to readers. > > Unless the readers are comparing add_object_entry() and this > function and wondering why this side lacks a check here, iow, when > they are merely following from a caller of this function through > this function down to its callee to understand what goes on, this > comment would not help them and only confuse them. > > If we were to say something to help those who are comparing these > two functions, I think we should be more explicit, i.e. > > The caller disables use-bitmap-index when --local or > --honor-pack-keep options are in effect because bitmap code is > not prepared to handle them. Because the control does not reach > here if these options are in effect, the check with > want_object_in_pack() to skip objects is not done. > > or something like that. You are probably right. > Or is the rest of the bitmap codepath prepared to handle these > options and it is just the matter of adding the missing check with > want_object_in_pack() here to make it work correctly? I'm waiting so long for main patch to be at least queued to pu, that I'm now a bit frustrated and ready to do something not related to main goal :) (they say every joke contains part of a joke). Here is something from sleepy me: 8< From: Kirill Smelkov Date: Wed, 27 Jul 2016 22:18:04 +0300 Subject: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to respect --local, --honor-pack-keep and --incremental Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there are two codepaths in pack-objects: with & without using bitmap reachability index. However add_object_entry_from_bitmap(), despite its non-bitmapped counterpart add_object_entry(), in no way does check for whether --local or --honor-pack-keep or --incremental should be respected. In non-bitmapped codepath this is handled in want_object_in_pack(), but bitmapped codepath has simply no such checking at all. The bitmapped codepath however was allowing to pass in all those options and with bitmap indices still being used under such conditions - potentially giving wrong output (e.g. including objects from non-local or .keep'ed pack). We can easily fix this by noting the following: when an object comes to add_object_entry_from_bitmap() it can come for two reasons: 1. entries coming from main pack covered by bitmap index, and 2. object coming from, possibly alternate, loose or other packs. For "2" we always have pack not yet found by bitmap traversal code, and thus we can simply reuse non-bitmapped want_object_in_pack() to find in which pack an object lives and also for taking omitting decision. For "1" we always have pack already found by bitmap traversal code and we only need to check that pack for same omission criteria used in want_object_in_pack() for found_pack. Suggested-by: Junio C Hamano Discussed-with: Jeff King --- builtin/pack
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
Kirill Smelkov writes: > > From: Kirill Smelkov > Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active > under > --local or --honor-pack-keep > > Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there > are two codepaths in pack-objects: with & without using bitmap > reachability index. > > However add_object_entry_from_bitmap(), despite its non-bitmapped > counterpart add_object_entry(), in no way does check for whether --local > or --honor-pack-keep should be respected. In non-bitmapped codepath this > is handled in want_object_in_pack(), but bitmapped codepath has simply > no such checking at all. > > The bitmapped codepath however was allowing to pass --local and > --honor-pack-keep and bitmap indices were still used under such > conditions - potentially giving wrong output (including objects from > non-local or .keep'ed pack). > > Instead of fixing bitmapped codepath to respect those options, since > currently no one actually need or use them in combination with bitmaps, > let's just force use_bitmap_index=0 when any of --local or > --honor-pack-keep are used and add appropriate comment about > not-checking for those in add_object_entry_from_bitmap() > > Suggested-by: Jeff King > --- > builtin/pack-objects.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index 15866d7..d7cf782 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const unsigned > char *sha1, > if (have_duplicate_entry(sha1, 0, &index_pos)) > return 0; > > + /* > + * for simplicity we always want object to be in pack, as > + * use_bitmap_index codepath assumes neither --local nor > --honor-pack-keep > + * is active. > + */ I am not sure this comment is useful to readers. Unless the readers are comparing add_object_entry() and this function and wondering why this side lacks a check here, iow, when they are merely following from a caller of this function through this function down to its callee to understand what goes on, this comment would not help them and only confuse them. If we were to say something to help those who are comparing these two functions, I think we should be more explicit, i.e. The caller disables use-bitmap-index when --local or --honor-pack-keep options are in effect because bitmap code is not prepared to handle them. Because the control does not reach here if these options are in effect, the check with want_object_in_pack() to skip objects is not done. or something like that. Or is the rest of the bitmap codepath prepared to handle these options and it is just the matter of adding the missing check with want_object_in_pack() here to make it work correctly? > create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, > offset); > > display_progress(progress_state, nr_result); > @@ -2776,6 +2782,15 @@ int cmd_pack_objects(int argc, const char **argv, > const char *prefix) > if (!use_internal_rev_list || !pack_to_stdout || > is_repository_shallow()) > use_bitmap_index = 0; > > + /* > + * "lazy" reasons not to use bitmaps; it is easier to reason about when > + * neither --local nor --honor-pack-keep is in action, and so far no one > + * needed nor implemented such support yet. > + */ Justifying comment like this is a good idea, but the comment above does not make it very clear that this is a correctness fix, i.e. if we do not disable, the code will do a wrong thing. The other logic to disable use of bitmap we can see in the pre-context would also benefit from some description as to why; 6b8fda2d (pack-objects: use bitmaps when packing objects, 2013-12-21) didn't do a very good job in that---the reason is not clear in its log message, either. > + if (local || ignore_packed_keep) > + use_bitmap_index = 0; > + > + I see one extra blank line here ;-) > if (pack_to_stdout || !rev_list_all) > write_bitmap_index = 0; Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote: > On Wed, Jul 13, 2016 at 01:52:17PM +0300, Kirill Smelkov wrote: > > > > So I think if you were to repeatedly "git repack -adb" over time, you > > > would get worse and worse ordering as objects are added to the > > > repository. > > > > Jeff, first of all thanks for clarifying. > > > > So it is not-yet-packed-objects which make packing with bitmap less > > efficient. I was originally keeping in mind fresh repacked repository > > with just built bitmap index and for that case extracting pack with > > bitmap index seems to be just ok, but the more not-yet-packed objects we > > have the worse the result can be. > > Right. So I think your scheme is fine as long as you are doing your > regular "pack all into one" repacks with a real walk, and then > "branching" off of that with one-off bitmap-computed packs into files > (even if you later take a bunch of those files and pull them into a > single bitmapped, as long as that final "all into one" does the walk). > > Or I guess another way to think about it would be that if you're > computing bitmaps, you'd want to do the actual traversal. Yes, exactly, and thanks for stating it clearly. We are doing repacks and recomputing bitmaps doing the real walk. As you say this should be fine. > > Yes, it also make sense. I saw write_reused_pack() in upstream git just > > copy raw bytes from original to destination pack. You mentioned you have > > something better for pack reuse - in your patch queue, in two words, is > > it now reusing pack based on object, not raw bytes, or is it something > > else? > > > > In other words in which way it works better? (I'm just curious here as > > it is interesting to know) > > The problem with the existing pack-reuse code is that it doesn't kick in > often enough. I think it looks to see that the client wants some > percentage of the pack (e.g., 90%), and then just sends the whole > beginning. This works especially badly if you have a bunch of related > repositories packed together (e.g., all of the forks of torvalds/linux > on GitHub), because you'll never hit 90% of that big pack; it has too > much unrelated cruft, even if most of the stuff you want _is_ at the > beginning. And "percent of pack" is not really a useful metric anyway. > > So the better scheme is more like: > > 1. Generate the bitmap of objects to send using reachability bitmaps. > > 2. Do a quick scan of their content in the packfile to see which can > be reused verbatim. If they're base objects, we can send them > as-is. If they're deltas, we can send them if their base is going > to be sent. This fills in another bitmap of "reusable" objects. > > After a long string of unusable objects, you can give up and set > the rest of the bitmap to zeroes. > > 3. Walk the "reuse" bitmap and send out the objects more-or-less > verbatim. You do have make adjustments to delta-base-offsets for > any "holes" (so if an object's entry says "my base is 500 bytes > back", but you omitted some objects in between, you have to adjust > that offset). > > The upside is that you can send out those objects without even making a > "struct object_entry" for them, which drastically reduces the memory > requirements for serving a clone. Any objects which didn't get marked > for reuse just get handled in the usual way (so stuff that was not close > by in the pack, or stuff that was pushed since your last big repack). Thanks for clarifying. Yes, you are right, current upstream code checks to see whether >= 90% of pack is what destination wants and only reuse in such case. (I forgot about it, initially putting reuse at side in my head as "not applicable to git-backup" because of that >= 90% reason). So with the scheme you are drawing above it can be indeed more efficient, and applicable to both torvalds/linux+forks and git-backup case (extracting packs from big pack of all repos). I'm looking forward to your patches on this topic. Please cc me on those if you find it convenient. > The downside is that because those objects aren't in our normal packing > list, they're not available as delta bases for the new objects we _do_ > send. So it can make the resulting pack a little bit bigger. So once again, the badness effect is the more, the more we have such "new" objects not in original main pack - i.e. as loose objects or objects living in other smaller packs. The badness comes to zero in ideal case of freshly repacked repo with only one big pack. Also: after sending reused object, with more code effort, in principle we can hook reused object for being considered as delta-bases for new objects. I mean this should not be impossible in principle, or am I missing something? > > Yes, for packing it is only hash which is used. And I assume name-hash > > for bitmap is not enabled by default for compatibility with JGit code. > > > > It would make sense to me to eventually enable na
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote: > > @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const > > unsigned char *sha1, > > { > > uint32_t index_pos; > > > > + if (local && has_loose_object_nonlocal(sha1)) > > + return 0; > > + > > if (have_duplicate_entry(sha1, 0, &index_pos)) > > return 0; > > Hrm. Adding entries from the bitmap should ideally be very fast, but > here we're introducing extra lookups in the object database. I guess it > only kicks in when --local is given, though, which most bitmap-using > paths would not do. > > But is this check enough? The non-bitmap code path calls > want_object_in_pack, which checks not only loose objects, but also > non-local packs, and .keep. > > Those don't kick in for your use case. I wonder if we should simply have > something like: > > if (local || ignore_packed_keep) > use_bitmap_index = 0; > > and just skip bitmaps for those cases. That's easy to reason about, and > I don't think anybody would care (your use case does not, and the repack > use case is already not going to use bitmaps). BTW, I thought we had more optimizations in this area, but I realized that I had never sent them to the list. I just did, and you may want to take a peek at: http://thread.gmane.org/gmane.comp.version-control.git/300218 I doubt it will speed up your case much (unless you really do have tons of packs in your extraction). And I think it is still worth doing disabling I showed above, even with the optimizations, just because it's easier to reason about. So I _think_ those optimizations are orthogonal to what we're discussing here, but I wanted to point you at them just in case. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Wed, Jul 13, 2016 at 01:52:17PM +0300, Kirill Smelkov wrote: > > So I think if you were to repeatedly "git repack -adb" over time, you > > would get worse and worse ordering as objects are added to the > > repository. > > Jeff, first of all thanks for clarifying. > > So it is not-yet-packed-objects which make packing with bitmap less > efficient. I was originally keeping in mind fresh repacked repository > with just built bitmap index and for that case extracting pack with > bitmap index seems to be just ok, but the more not-yet-packed objects we > have the worse the result can be. Right. So I think your scheme is fine as long as you are doing your regular "pack all into one" repacks with a real walk, and then "branching" off of that with one-off bitmap-computed packs into files (even if you later take a bunch of those files and pull them into a single bitmapped, as long as that final "all into one" does the walk). Or I guess another way to think about it would be that if you're computing bitmaps, you'd want to do the actual traversal. > Yes, it also make sense. I saw write_reused_pack() in upstream git just > copy raw bytes from original to destination pack. You mentioned you have > something better for pack reuse - in your patch queue, in two words, is > it now reusing pack based on object, not raw bytes, or is it something > else? > > In other words in which way it works better? (I'm just curious here as > it is interesting to know) The problem with the existing pack-reuse code is that it doesn't kick in often enough. I think it looks to see that the client wants some percentage of the pack (e.g., 90%), and then just sends the whole beginning. This works especially badly if you have a bunch of related repositories packed together (e.g., all of the forks of torvalds/linux on GitHub), because you'll never hit 90% of that big pack; it has too much unrelated cruft, even if most of the stuff you want _is_ at the beginning. And "percent of pack" is not really a useful metric anyway. So the better scheme is more like: 1. Generate the bitmap of objects to send using reachability bitmaps. 2. Do a quick scan of their content in the packfile to see which can be reused verbatim. If they're base objects, we can send them as-is. If they're deltas, we can send them if their base is going to be sent. This fills in another bitmap of "reusable" objects. After a long string of unusable objects, you can give up and set the rest of the bitmap to zeroes. 3. Walk the "reuse" bitmap and send out the objects more-or-less verbatim. You do have make adjustments to delta-base-offsets for any "holes" (so if an object's entry says "my base is 500 bytes back", but you omitted some objects in between, you have to adjust that offset). The upside is that you can send out those objects without even making a "struct object_entry" for them, which drastically reduces the memory requirements for serving a clone. Any objects which didn't get marked for reuse just get handled in the usual way (so stuff that was not close by in the pack, or stuff that was pushed since your last big repack). The downside is that because those objects aren't in our normal packing list, they're not available as delta bases for the new objects we _do_ send. So it can make the resulting pack a little bit bigger. > Yes, for packing it is only hash which is used. And I assume name-hash > for bitmap is not enabled by default for compatibility with JGit code. > > It would make sense to me to eventually enable name-hash bitmap > extension by default, as packing result is much better with it. And > those who care about compatibility with JGit can just turn it off in > their git config. Correct, the defaults are for JGit compatibility. If you are not using JGit, you should have it on all the time. We went with the conservative default, but as more people using regular Git bitmaps, it would probably be good to make them less arcane and confusing to use. > > As I understand your use case, it is OK to do the less careful things. > > It's just that pack-objects until now has been split into two modes: > > packing to a file is careful, and packing to stdout is less so. And you > > want to pack to a file in the non-careful mode. > > Yes, it should be ok, as after repository extraction git-backup > verifies rev-list for all refs > > https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L855 > > And if an object is missing - e.g. a blob - rev-list complains: > > fatal: missing blob object '980a0d5f19a64b4b30a87d4206aade58726b60e3' > > though it does not catch blob corruptions. Right, that makes sense. Even the pack-to-disk code invoked by git-repack is not foolproof for blob corruptions. It is only checking a crc, not the full sha1. So it's better than nothing, but not as careful as a full index-pack. > Probably pack.useBitmaps is of no use in normal situation, but for > debugging
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Tue, Jul 19, 2016 at 05:29:07AM -0600, Jeff King wrote: > On Sun, Jul 17, 2016 at 08:06:49PM +0300, Kirill Smelkov wrote: > > > > Anyway, please find below updated patch according to your suggestion. > > > Hope it is ok now. > > > > Ping. Is the patch ok or something needs to be improved still? > > Sorry, I'm traveling and haven't carefully reviewed it yet. It's still > on my list, but it may be a few days. Jeff thanks for feedback. Have a good traveling and good to know the patch was not forgotten. I will be waiting for the time while you are on trip. Thanks again for feedback, Kirill -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Sun, Jul 17, 2016 at 08:06:49PM +0300, Kirill Smelkov wrote: > > Anyway, please find below updated patch according to your suggestion. > > Hope it is ok now. > > Ping. Is the patch ok or something needs to be improved still? Sorry, I'm traveling and haven't carefully reviewed it yet. It's still on my list, but it may be a few days. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Wed, Jul 13, 2016 at 01:52:16PM +0300, Kirill Smelkov wrote: > On Wed, Jul 13, 2016 at 04:26:53AM -0400, Jeff King wrote: > > On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote: > > > > > > - we will not compute the same write order (which is based on > > > > traversal order), leading to packs that have less efficient cache > > > > characteristics > > > > > > I agree the order can be not exactly the same. Still if original pack is > > > packed well (with good recency order), while using bitmap we will tend > > > to traverse it in close to original order. > > > > > > Maybe I'm not completely right on this, but to me it looks to be the > > > case because if objects in original pack are put there linearly sorted > > > by recency order, and we use bitmap index to set of all reachable > > > objects from a root, and then just _linearly_ gather all those objects > > > from original pack by 1s in bitmap and put them in the same order into > > > destination pack, the recency order won't be broken. > > > > > > Or am I maybe misunderstanding something? > > > > Yeah, I think you can go some of the way by reusing the order from the > > old pack. But keep in mind that the bitmap result may also contain > > objects that are not yet packed. Those will just come in a big lump at > > the end of the bitmap (these are the "extended entries" in the bitmap > > code). > > > > So I think if you were to repeatedly "git repack -adb" over time, you > > would get worse and worse ordering as objects are added to the > > repository. > > Jeff, first of all thanks for clarifying. > > So it is not-yet-packed-objects which make packing with bitmap less > efficient. I was originally keeping in mind fresh repacked repository > with just built bitmap index and for that case extracting pack with > bitmap index seems to be just ok, but the more not-yet-packed objects we > have the worse the result can be. > > > As an aside, two other things that pack order matters for: it makes the > > bitmaps themselves compress better (because it increases locality of > > reachability, so you get nice runs of "1" or "0" bits). > > Yes I agree and thanks for bringing this up - putting objects in recency > order in pack also makes bitmap index to have larger runs of same 1 or 0. > > > It also makes > > the pack-reuse code more efficient (since in an ideal case, you can just > > dump a big block of data from the front of the pack). Note that the > > pack-reuse code that's in upstream git isn't that great; I have a better > > system on my big pile of patches to send upstream (that never seems to > > get smaller; ). > > Yes, it also make sense. I saw write_reused_pack() in upstream git just > copy raw bytes from original to destination pack. You mentioned you have > something better for pack reuse - in your patch queue, in two words, is > it now reusing pack based on object, not raw bytes, or is it something > else? > > In other words in which way it works better? (I'm just curious here as > it is interesting to know) > > > > > > - we don't learn about the filename of trees and blobs, which is going > > > > to make the delta step much less efficient. This might be mitigated > > > > by turning on the bitmap name-hash cache; I don't recall how much > > > > detail pack-objects needs on the name (i.e., the full name versus > > > > just the hash). > > > > > > If I understand it right, it uses only uint32_t name hash while > > > searching. From > > > pack-objects.{h,c} : > > > > Yeah, I think you are right. Not having the real names is a problem for > > doing rev-list output, but I think pack-objects doesn't care (though do > > note that the name-hash cache is not enabled by default). > > Yes, for packing it is only hash which is used. And I assume name-hash > for bitmap is not enabled by default for compatibility with JGit code. > > It would make sense to me to eventually enable name-hash bitmap > extension by default, as packing result is much better with it. And > those who care about compatibility with JGit can just turn it off in > their git config. > > Just my thoughts. > > > > > There may be other subtle things, too. The general idea of tying the > > > > bitmap use to pack_to_stdout is that you _do_ want to use it for > > > > serving fetches and pushes, but for a full on-disk repack via gc, it's > > > > more important to generate a good pack. > > > > > > It is better we send good packs to clients too, right? And with > > > pack.writeBitmapHashCache=true and retaining recency order (please see > > > above, but again maybe I'm not completely right) to me we should be still > > > generating a good pack while using bitmap reachability index for object > > > graph traversal. > > > > We do want to send the client a good pack, but it's always a tradeoff. > > We could spend much more time searching for the perfect delta, but at > > some point we have to decide on how much CPU to spend serving them. > > Li
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Wed, Jul 13, 2016 at 04:26:53AM -0400, Jeff King wrote: > On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote: > > > > - we will not compute the same write order (which is based on > > > traversal order), leading to packs that have less efficient cache > > > characteristics > > > > I agree the order can be not exactly the same. Still if original pack is > > packed well (with good recency order), while using bitmap we will tend > > to traverse it in close to original order. > > > > Maybe I'm not completely right on this, but to me it looks to be the > > case because if objects in original pack are put there linearly sorted > > by recency order, and we use bitmap index to set of all reachable > > objects from a root, and then just _linearly_ gather all those objects > > from original pack by 1s in bitmap and put them in the same order into > > destination pack, the recency order won't be broken. > > > > Or am I maybe misunderstanding something? > > Yeah, I think you can go some of the way by reusing the order from the > old pack. But keep in mind that the bitmap result may also contain > objects that are not yet packed. Those will just come in a big lump at > the end of the bitmap (these are the "extended entries" in the bitmap > code). > > So I think if you were to repeatedly "git repack -adb" over time, you > would get worse and worse ordering as objects are added to the > repository. Jeff, first of all thanks for clarifying. So it is not-yet-packed-objects which make packing with bitmap less efficient. I was originally keeping in mind fresh repacked repository with just built bitmap index and for that case extracting pack with bitmap index seems to be just ok, but the more not-yet-packed objects we have the worse the result can be. > As an aside, two other things that pack order matters for: it makes the > bitmaps themselves compress better (because it increases locality of > reachability, so you get nice runs of "1" or "0" bits). Yes I agree and thanks for bringing this up - putting objects in recency order in pack also makes bitmap index to have larger runs of same 1 or 0. > It also makes > the pack-reuse code more efficient (since in an ideal case, you can just > dump a big block of data from the front of the pack). Note that the > pack-reuse code that's in upstream git isn't that great; I have a better > system on my big pile of patches to send upstream (that never seems to > get smaller; ). Yes, it also make sense. I saw write_reused_pack() in upstream git just copy raw bytes from original to destination pack. You mentioned you have something better for pack reuse - in your patch queue, in two words, is it now reusing pack based on object, not raw bytes, or is it something else? In other words in which way it works better? (I'm just curious here as it is interesting to know) > > > - we don't learn about the filename of trees and blobs, which is going > > > to make the delta step much less efficient. This might be mitigated > > > by turning on the bitmap name-hash cache; I don't recall how much > > > detail pack-objects needs on the name (i.e., the full name versus > > > just the hash). > > > > If I understand it right, it uses only uint32_t name hash while searching. > > From > > pack-objects.{h,c} : > > Yeah, I think you are right. Not having the real names is a problem for > doing rev-list output, but I think pack-objects doesn't care (though do > note that the name-hash cache is not enabled by default). Yes, for packing it is only hash which is used. And I assume name-hash for bitmap is not enabled by default for compatibility with JGit code. It would make sense to me to eventually enable name-hash bitmap extension by default, as packing result is much better with it. And those who care about compatibility with JGit can just turn it off in their git config. Just my thoughts. > > > There may be other subtle things, too. The general idea of tying the > > > bitmap use to pack_to_stdout is that you _do_ want to use it for > > > serving fetches and pushes, but for a full on-disk repack via gc, it's > > > more important to generate a good pack. > > > > It is better we send good packs to clients too, right? And with > > pack.writeBitmapHashCache=true and retaining recency order (please see > > above, but again maybe I'm not completely right) to me we should be still > > generating a good pack while using bitmap reachability index for object > > graph traversal. > > We do want to send the client a good pack, but it's always a tradeoff. > We could spend much more time searching for the perfect delta, but at > some point we have to decide on how much CPU to spend serving them. > Likewise, even if the bitmapped packs we send are in slightly worse > order, saving a minute of CPU time off of every clone of the kernel is a > big deal. Yes, this I understand and agree. Like I said above I was imagining freshly repacked repo with recently rebuilt
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Tue, Jul 12, 2016 at 10:08:08PM +0300, Kirill Smelkov wrote: > > Or are we fine with my arguments about recency order staying the same > > when using bitmap reachability index for object graph traversal, and this > > way the patch is fine to go in as it is? > > Since there is no reply I assume the safe way to go is to let default > for pack-to-file case to be "not using bitmap index". Please find updated > patch and interdiff below. I would still be grateful for feedback on > my above use-bitmap-for-pack-to-file arguments. Yeah, I think that is a reasonable approach. I see here you've added new config, though, and I don't think we want that. For your purposes, where you're driving pack-objects individually, I think a command-line option makes more sense. If we did want to have a flag for "use bitmaps when repacking via repack", I think it should be "repack.useBitmaps", and git-repack should pass the command-line option to pack-objects. pack-objects is porcelain and should not really be reading config at all. You'll note that pack.writeBitmaps was a mistake and got deprecated in favor of repack.writeBitmaps. I think pack.useBitmaps is a mistake, too, but nobody has really noticed or cared because there's no good reason to set it (the more interesting question is: are there bitmaps available? and if so, we try to use them). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote: > > - we will not compute the same write order (which is based on > > traversal order), leading to packs that have less efficient cache > > characteristics > > I agree the order can be not exactly the same. Still if original pack is > packed well (with good recency order), while using bitmap we will tend > to traverse it in close to original order. > > Maybe I'm not completely right on this, but to me it looks to be the > case because if objects in original pack are put there linearly sorted > by recency order, and we use bitmap index to set of all reachable > objects from a root, and then just _linearly_ gather all those objects > from original pack by 1s in bitmap and put them in the same order into > destination pack, the recency order won't be broken. > > Or am I maybe misunderstanding something? Yeah, I think you can go some of the way by reusing the order from the old pack. But keep in mind that the bitmap result may also contain objects that are not yet packed. Those will just come in a big lump at the end of the bitmap (these are the "extended entries" in the bitmap code). So I think if you were to repeatedly "git repack -adb" over time, you would get worse and worse ordering as objects are added to the repository. As an aside, two other things that pack order matters for: it makes the bitmaps themselves compress better (because it increases locality of reachability, so you get nice runs of "1" or "0" bits). It also makes the pack-reuse code more efficient (since in an ideal case, you can just dump a big block of data from the front of the pack). Note that the pack-reuse code that's in upstream git isn't that great; I have a better system on my big pile of patches to send upstream (that never seems to get smaller; ). > > - we don't learn about the filename of trees and blobs, which is going > > to make the delta step much less efficient. This might be mitigated > > by turning on the bitmap name-hash cache; I don't recall how much > > detail pack-objects needs on the name (i.e., the full name versus > > just the hash). > > If I understand it right, it uses only uint32_t name hash while searching. > From > pack-objects.{h,c} : Yeah, I think you are right. Not having the real names is a problem for doing rev-list output, but I think pack-objects doesn't care (though do note that the name-hash cache is not enabled by default). > > There may be other subtle things, too. The general idea of tying the > > bitmap use to pack_to_stdout is that you _do_ want to use it for > > serving fetches and pushes, but for a full on-disk repack via gc, it's > > more important to generate a good pack. > > It is better we send good packs to clients too, right? And with > pack.writeBitmapHashCache=true and retaining recency order (please see > above, but again maybe I'm not completely right) to me we should be still > generating a good pack while using bitmap reachability index for object > graph traversal. We do want to send the client a good pack, but it's always a tradeoff. We could spend much more time searching for the perfect delta, but at some point we have to decide on how much CPU to spend serving them. Likewise, even if the bitmapped packs we send are in slightly worse order, saving a minute of CPU time off of every clone of the kernel is a big deal. We also take robustness shortcuts when sending to clients. For example, when doing an on-disk repack we re-crc32 all of the delta data we are reusing, even if we don't actually inflate it (because we would want to stop immediately if we see even a single bit flipped on disk). But we don't check them when sending to a client, because we know they are going to actually `index-pack` it and get a stronger consistency check anyway, and don't want to waste server CPU. The bitmaps are sort of the same. If there is a bug or corruption in the bitmap, the worst case is that we send a broken pack to the client, who will complain that we did not give them all of the objects. It's a momentary problem that can be fixed. If you use them for an on-disk repack, then the next step is usually to delete all of the old packs. So a corruption there carries forward, and is irreversible. As I understand your use case, it is OK to do the less careful things. It's just that pack-objects until now has been split into two modes: packing to a file is careful, and packing to stdout is less so. And you want to pack to a file in the non-careful mode. > > but I wonder if you should be using "pack-objects --stdout" yourself. > > I already tried --stdout. The problem is on repository extraction we > need to both extract the pack and index it. While `pack-object file` > does both, for --stdout case we need to additionally index extracted > pack with `git index-pack`, and standalone `git index-pack` is very slow > - in my experience much slower than generating the pack itself: Ah, right, that
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote: > Peff first of all thanks for feedback, > > On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote: > > On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote: > > > > > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects) > > > if a repository has bitmap index, pack-objects can nicely speedup > > > "Counting objects" graph traversal phase. That however was done only for > > > case when resultant pack is sent to stdout, not written into a file. > > > > > > We can teach pack-objects to use bitmap index for initial object > > > counting phase when generating resultant pack file too: > > > > I'm not sure this is a good idea in general. When bitmaps are in use, we > > cannot fill out the details in the object-packing list as thoroughly. In > > particular: > > > > - we will not compute the same write order (which is based on > > traversal order), leading to packs that have less efficient cache > > characteristics > > I agree the order can be not exactly the same. Still if original pack is > packed well (with good recency order), while using bitmap we will tend > to traverse it in close to original order. > > Maybe I'm not completely right on this, but to me it looks to be the > case because if objects in original pack are put there linearly sorted > by recency order, and we use bitmap index to set of all reachable > objects from a root, and then just _linearly_ gather all those objects > from original pack by 1s in bitmap and put them in the same order into > destination pack, the recency order won't be broken. > > Or am I maybe misunderstanding something? > > Please also see below: > > > - we don't learn about the filename of trees and blobs, which is going > > to make the delta step much less efficient. This might be mitigated > > by turning on the bitmap name-hash cache; I don't recall how much > > detail pack-objects needs on the name (i.e., the full name versus > > just the hash). > > If I understand it right, it uses only uint32_t name hash while searching. > From > pack-objects.{h,c} : > > 8< > struct object_entry { > ... > uint32_t hash; /* name hint hash */ > > > /* > * We search for deltas in a list sorted by type, by filename hash, and then > * by size, so that we see progressively smaller and smaller files. > * That's because we prefer deltas to be from the bigger file > * to the smaller -- deletes are potentially cheaper, but perhaps > * more importantly, the bigger file is likely the more recent > * one. The deepest deltas are therefore the oldest objects which are > * less susceptible to be accessed often. > */ > static int type_size_sort(const void *_a, const void *_b) > { > const struct object_entry *a = *(struct object_entry **)_a; > const struct object_entry *b = *(struct object_entry **)_b; > > if (a->type > b->type) > return -1; > if (a->type < b->type) > return 1; > if (a->hash > b->hash) > return -1; > if (a->hash < b->hash) > return 1; > ... > 8< > > Documentation/technical/pack-heuristics.txt also confirms this: > > 8< > ... > The quote from the above linus should be rewritten a > bit (wait for it): > - first sort by type. Different objects never delta with > each other. > - then sort by filename/dirname. hash of the basename > occupies the top BITS_PER_INT-DIR_BITS bits, and bottom > DIR_BITS are for the hash of leading path elements. > > ... > > If I might add, the trick is to make files that _might_ be similar be > located close to each other in the hash buckets based on their file > names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and > "Makefile" all landed in the same bucket due to their common basename, > "Makefile". However, now they land in "close" buckets. > > The algorithm allows not just for the _same_ bucket, but for _close_ > buckets to be considered delta candidates. The rationale is > essentially that files, like Makefiles, often have very similar > content no matter what directory they live in. > 8< > > > So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the > delta-search heuristics is almost as efficient as with just raw filenames. > > I can confirm this also via e.g. (with my patch applied) : > > 8< > $ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs > erp5pack-plain > Counting objects: 627171, done. > Compressing objects: 100% (176949/176949), done. > 50570987560d481742af4a8083028c2322a0534a > Writing objects: 100% (627171/627171), done. > Total 627171 (delta 439404), reused 594820 (delta 410210) > > real0m37.272s > user0m33.648s > sys
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
Peff first of all thanks for feedback, On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote: > On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote: > > > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects) > > if a repository has bitmap index, pack-objects can nicely speedup > > "Counting objects" graph traversal phase. That however was done only for > > case when resultant pack is sent to stdout, not written into a file. > > > > We can teach pack-objects to use bitmap index for initial object > > counting phase when generating resultant pack file too: > > I'm not sure this is a good idea in general. When bitmaps are in use, we > cannot fill out the details in the object-packing list as thoroughly. In > particular: > > - we will not compute the same write order (which is based on > traversal order), leading to packs that have less efficient cache > characteristics I agree the order can be not exactly the same. Still if original pack is packed well (with good recency order), while using bitmap we will tend to traverse it in close to original order. Maybe I'm not completely right on this, but to me it looks to be the case because if objects in original pack are put there linearly sorted by recency order, and we use bitmap index to set of all reachable objects from a root, and then just _linearly_ gather all those objects from original pack by 1s in bitmap and put them in the same order into destination pack, the recency order won't be broken. Or am I maybe misunderstanding something? Please also see below: > - we don't learn about the filename of trees and blobs, which is going > to make the delta step much less efficient. This might be mitigated > by turning on the bitmap name-hash cache; I don't recall how much > detail pack-objects needs on the name (i.e., the full name versus > just the hash). If I understand it right, it uses only uint32_t name hash while searching. From pack-objects.{h,c} : 8< struct object_entry { ... uint32_t hash; /* name hint hash */ /* * We search for deltas in a list sorted by type, by filename hash, and then * by size, so that we see progressively smaller and smaller files. * That's because we prefer deltas to be from the bigger file * to the smaller -- deletes are potentially cheaper, but perhaps * more importantly, the bigger file is likely the more recent * one. The deepest deltas are therefore the oldest objects which are * less susceptible to be accessed often. */ static int type_size_sort(const void *_a, const void *_b) { const struct object_entry *a = *(struct object_entry **)_a; const struct object_entry *b = *(struct object_entry **)_b; if (a->type > b->type) return -1; if (a->type < b->type) return 1; if (a->hash > b->hash) return -1; if (a->hash < b->hash) return 1; ... 8< Documentation/technical/pack-heuristics.txt also confirms this: 8< ... The quote from the above linus should be rewritten a bit (wait for it): - first sort by type. Different objects never delta with each other. - then sort by filename/dirname. hash of the basename occupies the top BITS_PER_INT-DIR_BITS bits, and bottom DIR_BITS are for the hash of leading path elements. ... If I might add, the trick is to make files that _might_ be similar be located close to each other in the hash buckets based on their file names. It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and "Makefile" all landed in the same bucket due to their common basename, "Makefile". However, now they land in "close" buckets. The algorithm allows not just for the _same_ bucket, but for _close_ buckets to be considered delta candidates. The rationale is essentially that files, like Makefiles, often have very similar content no matter what directory they live in. 8< So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the delta-search heuristics is almost as efficient as with just raw filenames. I can confirm this also via e.g. (with my patch applied) : 8< $ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs erp5pack-plain Counting objects: 627171, done. Compressing objects: 100% (176949/176949), done. 50570987560d481742af4a8083028c2322a0534a Writing objects: 100% (627171/627171), done. Total 627171 (delta 439404), reused 594820 (delta 410210) real0m37.272s user0m33.648s sys 0m1.580s $ time echo 0186ac99 | git pack-objects --revs erp5pack-bitmap Counting objects: 627171, done. Compressing objects: 100% (176914/176914), done. 7c15a9b1eca1326e679297b217c5a48954625ca2 Writing objects: 100% (627171/627171), done. Total 627171 (delta 439484), reused 594855 (delta 410245) re
Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote: > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects) > if a repository has bitmap index, pack-objects can nicely speedup > "Counting objects" graph traversal phase. That however was done only for > case when resultant pack is sent to stdout, not written into a file. > > We can teach pack-objects to use bitmap index for initial object > counting phase when generating resultant pack file too: I'm not sure this is a good idea in general. When bitmaps are in use, we cannot fill out the details in the object-packing list as thoroughly. In particular: - we will not compute the same write order (which is based on traversal order), leading to packs that have less efficient cache characteristics - we don't learn about the filename of trees and blobs, which is going to make the delta step much less efficient. This might be mitigated by turning on the bitmap name-hash cache; I don't recall how much detail pack-objects needs on the name (i.e., the full name versus just the hash). There may be other subtle things, too. The general idea of tying the bitmap use to pack_to_stdout is that you _do_ want to use it for serving fetches and pushes, but for a full on-disk repack via gc, it's more important to generate a good pack. Your use case: > git-backup extracts many packs on repositories restoration. That was my > initial motivation for the patch. Seems to be somewhere in between. I'm not sure I understand how you're invoking pack-objects here, but I wonder if you should be using "pack-objects --stdout" yourself. But even if it is the right thing for your use case to be using bitmaps to generate an on-disk bitmap, I think we should be making sure it _doesn't_ trigger when doing a normal repack. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too
Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects) if a repository has bitmap index, pack-objects can nicely speedup "Counting objects" graph traversal phase. That however was done only for case when resultant pack is sent to stdout, not written into a file. We can teach pack-objects to use bitmap index for initial object counting phase when generating resultant pack file too: - if we know bitmap index generation is not enabled for resultant pack: Current code has singleton bitmap_git so cannot work simultaneously with two bitmap indices. - if we keep pack reuse enabled still only for "send-to-stdout" case: Because on pack reuse raw entries are directly written out to destination pack by write_reused_pack() bypassing needed for pack index generation bookkeeping done by regular codepath in write_one() and friends. (at least that's my understanding after briefly looking at the code) We also need to care and teach add_object_entry_from_bitmap() to respect --local via not adding nonlocal loose object to resultant pack (this is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local to mean ignore non-local loose objects too) -- not to break 'loose objects in alternate ODB are not repacked' in t7700-repack.sh . Otherwise all git tests pass, and for pack-objects -> file we get nice speedup: erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup repository managed by git-backup[2] via time echo 0186ac99 | git pack-objects --revs erp5pack before: 37.2s after: 26.2s And for `git repack -adb` packed git.git time echo 5c589a73 | git pack-objects --revs gitpack before: 7.1s after:3.6s i.e. it can be 30% - 50% speedup for pack extraction. git-backup extracts many packs on repositories restoration. That was my initial motivation for the patch. [1] https://lab.nexedi.com/nexedi/erp5 [2] https://lab.nexedi.com/kirr/git-backup Cc: Vicent Marti Cc: Jeff King Signed-off-by: Kirill Smelkov --- builtin/pack-objects.c | 7 +-- t/t5310-pack-bitmaps.sh | 9 + 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c index a2f8cfd..be0ebe8 100644 --- a/builtin/pack-objects.c +++ b/builtin/pack-objects.c @@ -1052,6 +1052,9 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1, { uint32_t index_pos; + if (local && has_loose_object_nonlocal(sha1)) + return 0; + if (have_duplicate_entry(sha1, 0, &index_pos)) return 0; @@ -2488,7 +2491,7 @@ static int get_object_list_from_bitmap(struct rev_info *revs) if (prepare_bitmap_walk(revs) < 0) return -1; - if (pack_options_allow_reuse() && + if (pack_options_allow_reuse() && pack_to_stdout && !reuse_partial_packfile_from_bitmap( &reuse_packfile, &reuse_packfile_objects, @@ -2773,7 +2776,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix) if (!rev_list_all || !rev_list_reflog || !rev_list_index) unpack_unreachable_expiration = 0; - if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow()) + if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow()) use_bitmap_index = 0; if (pack_to_stdout || !rev_list_all) diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh index 3893afd..533fc31 100755 --- a/t/t5310-pack-bitmaps.sh +++ b/t/t5310-pack-bitmaps.sh @@ -118,6 +118,15 @@ test_expect_success 'incremental repack can disable bitmaps' ' git repack -d --no-write-bitmap-index ' +test_expect_success 'pack-objects to file can use bitmap' ' + # make sure we still have 1 bitmap index from previous tests + ls .git/objects/pack/ | grep bitmap >output && + test_line_count = 1 output && + # pack-objects uses bitmap index by default, when it is available + packsha1=$(git pack-objects --all mypack output && -- 2.9.0.431.gb11dac7.dirty -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html