Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-29 Thread Kirill Smelkov
On Thu, Jul 28, 2016 at 02:18:29PM -0700, Junio C Hamano wrote:
> Kirill Smelkov  writes:
> 
> > I'm waiting so long for main patch to be at least queued to pu, that I'm
> > now a bit frustrated and ready to do something not related to main goal :)
> 
> Perhaps the first step would be to stop putting multiple patches in
> a single e-mail buried after a few pages of discussion.  I will not
> even find that there _are_ multiple patches in the message if I am
> not involved directly in the discussion, and the discussion is still
> ongoing, because it is likely that I'd skim just a few paragraphs at
> the top before going on to other messages.
> 
> I won't touch the message I am responding to, as your -- 8< -- cut
> mark does not even seem to be a reliable marker between patches
> (i.e.  I see something like this that is clearly not a message
> boundary:
> 
> than `git pack-objects file.pack`. Extracting erp5.git pack from
> lab.nexedi.com backup repository:
> 
>  8< 
> $ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack
> 
> real0m22.309s
> ...
> )

Ok, makes sense and my fault. I'm resending each patch as separate
message in reply to this mail.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-28 Thread Junio C Hamano
Kirill Smelkov  writes:

> I'm waiting so long for main patch to be at least queued to pu, that I'm
> now a bit frustrated and ready to do something not related to main goal :)

Perhaps the first step would be to stop putting multiple patches in
a single e-mail buried after a few pages of discussion.  I will not
even find that there _are_ multiple patches in the message if I am
not involved directly in the discussion, and the discussion is still
ongoing, because it is likely that I'd skim just a few paragraphs at
the top before going on to other messages.

I won't touch the message I am responding to, as your -- 8< -- cut
mark does not even seem to be a reliable marker between patches
(i.e.  I see something like this that is clearly not a message
boundary:

than `git pack-objects file.pack`. Extracting erp5.git pack from
lab.nexedi.com backup repository:

 8< 
$ time echo 0186ac99 | git pack-objects --stdout --revs >erp5pack-stdout.pack

real0m22.309s
...
)

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-28 Thread Kirill Smelkov
Junio, first of all thanks for feedback,

On Wed, Jul 27, 2016 at 01:40:36PM -0700, Junio C Hamano wrote:
> Kirill Smelkov  writes:
> 
> > > From: Kirill Smelkov 
> > Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active 
> > under
> >  --local or --honor-pack-keep
> >
> > Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> > are two codepaths in pack-objects: with & without using bitmap
> > reachability index.
> >
> > However add_object_entry_from_bitmap(), despite its non-bitmapped
> > counterpart add_object_entry(), in no way does check for whether --local
> > or --honor-pack-keep should be respected. In non-bitmapped codepath this
> > is handled in want_object_in_pack(), but bitmapped codepath has simply
> > no such checking at all.
> >
> > The bitmapped codepath however was allowing to pass --local and
> > --honor-pack-keep and bitmap indices were still used under such
> > conditions - potentially giving wrong output (including objects from
> > non-local or .keep'ed pack).
> >
> > Instead of fixing bitmapped codepath to respect those options, since
> > currently no one actually need or use them in combination with bitmaps,
> > let's just force use_bitmap_index=0 when any of --local or
> > --honor-pack-keep are used and add appropriate comment about
> > not-checking for those in add_object_entry_from_bitmap()
> >
> > Suggested-by: Jeff King 
> > ---
> >  builtin/pack-objects.c | 15 +++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> > index 15866d7..d7cf782 100644
> > --- a/builtin/pack-objects.c
> > +++ b/builtin/pack-objects.c
> > @@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const 
> > unsigned char *sha1,
> > if (have_duplicate_entry(sha1, 0, &index_pos))
> > return 0;
> >  
> > +   /*
> > +* for simplicity we always want object to be in pack, as
> > +* use_bitmap_index codepath assumes neither --local nor 
> > --honor-pack-keep
> > +* is active.
> > +*/
> 
> I am not sure this comment is useful to readers.
> 
> Unless the readers are comparing add_object_entry() and this
> function and wondering why this side lacks a check here, iow, when
> they are merely following from a caller of this function through
> this function down to its callee to understand what goes on, this
> comment would not help them and only confuse them.
> 
> If we were to say something to help those who are comparing these
> two functions, I think we should be more explicit, i.e.
> 
> The caller disables use-bitmap-index when --local or
> --honor-pack-keep options are in effect because bitmap code is
> not prepared to handle them.  Because the control does not reach
> here if these options are in effect, the check with
> want_object_in_pack() to skip objects is not done.
> 
> or something like that.

You are probably right.


> Or is the rest of the bitmap codepath prepared to handle these
> options and it is just the matter of adding the missing check with
> want_object_in_pack() here to make it work correctly?

I'm waiting so long for main patch to be at least queued to pu, that I'm
now a bit frustrated and ready to do something not related to main goal :)
(they say every joke contains part of a joke). Here is something from
sleepy me:

 8< 
From: Kirill Smelkov 
Date: Wed, 27 Jul 2016 22:18:04 +0300
Subject: [PATCH 1/2] pack-objects: Teach --use-bitmap-index codepath to
 respect --local, --honor-pack-keep and --incremental

Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
are two codepaths in pack-objects: with & without using bitmap
reachability index.

However add_object_entry_from_bitmap(), despite its non-bitmapped
counterpart add_object_entry(), in no way does check for whether --local
or --honor-pack-keep or --incremental should be respected. In
non-bitmapped codepath this is handled in want_object_in_pack(), but
bitmapped codepath has simply no such checking at all.

The bitmapped codepath however was allowing to pass in all those options
and with bitmap indices still being used under such conditions -
potentially giving wrong output (e.g. including objects from non-local or
.keep'ed pack).

We can easily fix this by noting the following: when an object comes to
add_object_entry_from_bitmap() it can come for two reasons:

1. entries coming from main pack covered by bitmap index, and
2. object coming from, possibly alternate, loose or other packs.

For "2" we always have pack not yet found by bitmap traversal code, and
thus we can simply reuse non-bitmapped want_object_in_pack() to find in
which pack an object lives and also for taking omitting decision.

For "1" we always have pack already found by bitmap traversal code and we
only need to check that pack for same omission criteria used in
want_object_in_pack() for found_pack.

Suggested-by: Junio C Hamano 
Discussed-with: Jeff King 
---
 builtin/pack

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-27 Thread Junio C Hamano
Kirill Smelkov  writes:

> > From: Kirill Smelkov 
> Subject: [PATCH 1/2] pack-objects: Make sure use_bitmap_index is not active 
> under
>  --local or --honor-pack-keep
>
> Since 6b8fda2d (pack-objects: use bitmaps when packing objects) there
> are two codepaths in pack-objects: with & without using bitmap
> reachability index.
>
> However add_object_entry_from_bitmap(), despite its non-bitmapped
> counterpart add_object_entry(), in no way does check for whether --local
> or --honor-pack-keep should be respected. In non-bitmapped codepath this
> is handled in want_object_in_pack(), but bitmapped codepath has simply
> no such checking at all.
>
> The bitmapped codepath however was allowing to pass --local and
> --honor-pack-keep and bitmap indices were still used under such
> conditions - potentially giving wrong output (including objects from
> non-local or .keep'ed pack).
>
> Instead of fixing bitmapped codepath to respect those options, since
> currently no one actually need or use them in combination with bitmaps,
> let's just force use_bitmap_index=0 when any of --local or
> --honor-pack-keep are used and add appropriate comment about
> not-checking for those in add_object_entry_from_bitmap()
>
> Suggested-by: Jeff King 
> ---
>  builtin/pack-objects.c | 15 +++
>  1 file changed, 15 insertions(+)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 15866d7..d7cf782 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1055,6 +1055,12 @@ static int add_object_entry_from_bitmap(const unsigned 
> char *sha1,
>   if (have_duplicate_entry(sha1, 0, &index_pos))
>   return 0;
>  
> + /*
> +  * for simplicity we always want object to be in pack, as
> +  * use_bitmap_index codepath assumes neither --local nor 
> --honor-pack-keep
> +  * is active.
> +  */

I am not sure this comment is useful to readers.

Unless the readers are comparing add_object_entry() and this
function and wondering why this side lacks a check here, iow, when
they are merely following from a caller of this function through
this function down to its callee to understand what goes on, this
comment would not help them and only confuse them.

If we were to say something to help those who are comparing these
two functions, I think we should be more explicit, i.e.

The caller disables use-bitmap-index when --local or
--honor-pack-keep options are in effect because bitmap code is
not prepared to handle them.  Because the control does not reach
here if these options are in effect, the check with
want_object_in_pack() to skip objects is not done.

or something like that.

Or is the rest of the bitmap codepath prepared to handle these
options and it is just the matter of adding the missing check with
want_object_in_pack() here to make it work correctly?

>   create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, 
> offset);
>  
>   display_progress(progress_state, nr_result);
> @@ -2776,6 +2782,15 @@ int cmd_pack_objects(int argc, const char **argv, 
> const char *prefix)
>   if (!use_internal_rev_list || !pack_to_stdout || 
> is_repository_shallow())
>   use_bitmap_index = 0;
>  
> + /*
> +  * "lazy" reasons not to use bitmaps; it is easier to reason about when
> +  * neither --local nor --honor-pack-keep is in action, and so far no one
> +  * needed nor implemented such support yet.
> +  */

Justifying comment like this is a good idea, but the comment above
does not make it very clear that this is a correctness fix, i.e. if
we do not disable, the code will do a wrong thing.

The other logic to disable use of bitmap we can see in the
pre-context would also benefit from some description as to why;
6b8fda2d (pack-objects: use bitmaps when packing objects,
2013-12-21) didn't do a very good job in that---the reason is not
clear in its log message, either.

> + if (local || ignore_packed_keep)
> + use_bitmap_index = 0;
> +
> +

I see one extra blank line here ;-)

>   if (pack_to_stdout || !rev_list_all)
>   write_bitmap_index = 0;

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-27 Thread Kirill Smelkov
On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote:
> On Wed, Jul 13, 2016 at 01:52:17PM +0300, Kirill Smelkov wrote:
> 
> > > So I think if you were to repeatedly "git repack -adb" over time, you
> > > would get worse and worse ordering as objects are added to the
> > > repository.
> > 
> > Jeff, first of all thanks for clarifying.
> > 
> > So it is not-yet-packed-objects which make packing with bitmap less
> > efficient. I was originally keeping in mind fresh repacked repository
> > with just built bitmap index and for that case extracting pack with
> > bitmap index seems to be just ok, but the more not-yet-packed objects we
> > have the worse the result can be.
> 
> Right. So I think your scheme is fine as long as you are doing your
> regular "pack all into one" repacks with a real walk, and then
> "branching" off of that with one-off bitmap-computed packs into files
> (even if you later take a bunch of those files and pull them into a
> single bitmapped, as long as that final "all into one" does the walk).
> 
> Or I guess another way to think about it would be that if you're
> computing bitmaps, you'd want to do the actual traversal.

Yes, exactly, and thanks for stating it clearly. We are doing repacks
and recomputing bitmaps doing the real walk. As you say this should be
fine.


> > Yes, it also make sense. I saw write_reused_pack() in upstream git just
> > copy raw bytes from original to destination pack. You mentioned you have
> > something better for pack reuse - in your patch queue, in two words, is
> > it now reusing pack based on object, not raw bytes, or is it something
> > else?
> > 
> > In other words in which way it works better? (I'm just curious here as
> > it is interesting to know)
> 
> The problem with the existing pack-reuse code is that it doesn't kick in
> often enough. I think it looks to see that the client wants some
> percentage of the pack (e.g., 90%), and then just sends the whole
> beginning. This works especially badly if you have a bunch of related
> repositories packed together (e.g., all of the forks of torvalds/linux
> on GitHub), because you'll never hit 90% of that big pack; it has too
> much unrelated cruft, even if most of the stuff you want _is_ at the
> beginning. And "percent of pack" is not really a useful metric anyway.
> 
> So the better scheme is more like:
> 
>   1. Generate the bitmap of objects to send using reachability bitmaps.
> 
>   2. Do a quick scan of their content in the packfile to see which can
>  be reused verbatim. If they're base objects, we can send them
>  as-is. If they're deltas, we can send them if their base is going
>  to be sent. This fills in another bitmap of "reusable" objects.
> 
>  After a long string of unusable objects, you can give up and set
>  the rest of the bitmap to zeroes.
> 
>   3. Walk the "reuse" bitmap and send out the objects more-or-less
>  verbatim. You do have make adjustments to delta-base-offsets for
>  any "holes" (so if an object's entry says "my base is 500 bytes
>  back", but you omitted some objects in between, you have to adjust
>  that offset).
> 
> The upside is that you can send out those objects without even making a
> "struct object_entry" for them, which drastically reduces the memory
> requirements for serving a clone. Any objects which didn't get marked
> for reuse just get handled in the usual way (so stuff that was not close
> by in the pack, or stuff that was pushed since your last big repack).

Thanks for clarifying. Yes, you are right, current upstream code checks
to see whether >= 90% of pack is what destination wants and only reuse
in such case. (I forgot about it, initially putting reuse at side in my
head as "not applicable to git-backup" because of that >= 90% reason).

So with the scheme you are drawing above it can be indeed more
efficient, and applicable to both torvalds/linux+forks and git-backup
case (extracting packs from big pack of all repos).

I'm looking forward to your patches on this topic. Please cc me on those
if you find it convenient.

> The downside is that because those objects aren't in our normal packing
> list, they're not available as delta bases for the new objects we _do_
> send. So it can make the resulting pack a little bit bigger.

So once again, the badness effect is the more, the more we have such
"new" objects not in original main pack - i.e. as loose objects or
objects living in other smaller packs. The badness comes to zero in
ideal case of freshly repacked repo with only one big pack.

Also: after sending reused object, with more code effort, in principle
we can hook reused object for being considered as delta-bases for new
objects. I mean this should not be impossible in principle, or am I
missing something?


> > Yes, for packing it is only hash which is used. And I assume name-hash
> > for bitmap is not enabled by default for compatibility with JGit code.
> >
> > It would make sense to me to eventually enable na

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-25 Thread Jeff King
On Mon, Jul 25, 2016 at 02:40:25PM -0400, Jeff King wrote:

> > @@ -1052,6 +1053,9 @@ static int add_object_entry_from_bitmap(const 
> > unsigned char *sha1,
> >  {
> > uint32_t index_pos;
> >  
> > +   if (local && has_loose_object_nonlocal(sha1))
> > +   return 0;
> > +
> > if (have_duplicate_entry(sha1, 0, &index_pos))
> > return 0;
> 
> Hrm. Adding entries from the bitmap should ideally be very fast, but
> here we're introducing extra lookups in the object database. I guess it
> only kicks in when --local is given, though, which most bitmap-using
> paths would not do.
> 
> But is this check enough? The non-bitmap code path calls
> want_object_in_pack, which checks not only loose objects, but also
> non-local packs, and .keep.
> 
> Those don't kick in for your use case. I wonder if we should simply have
> something like:
> 
>   if (local || ignore_packed_keep)
>   use_bitmap_index = 0;
> 
> and just skip bitmaps for those cases. That's easy to reason about, and
> I don't think anybody would care (your use case does not, and the repack
> use case is already not going to use bitmaps).

BTW, I thought we had more optimizations in this area, but I realized
that I had never sent them to the list. I just did, and you may want to
take a peek at:

  http://thread.gmane.org/gmane.comp.version-control.git/300218

I doubt it will speed up your case much (unless you really do have tons
of packs in your extraction). And I think it is still worth doing
disabling I showed above, even with the optimizations, just because it's
easier to reason about.

So I _think_ those optimizations are orthogonal to what we're discussing
here, but I wanted to point you at them just in case.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-25 Thread Jeff King
On Wed, Jul 13, 2016 at 01:52:17PM +0300, Kirill Smelkov wrote:

> > So I think if you were to repeatedly "git repack -adb" over time, you
> > would get worse and worse ordering as objects are added to the
> > repository.
> 
> Jeff, first of all thanks for clarifying.
> 
> So it is not-yet-packed-objects which make packing with bitmap less
> efficient. I was originally keeping in mind fresh repacked repository
> with just built bitmap index and for that case extracting pack with
> bitmap index seems to be just ok, but the more not-yet-packed objects we
> have the worse the result can be.

Right. So I think your scheme is fine as long as you are doing your
regular "pack all into one" repacks with a real walk, and then
"branching" off of that with one-off bitmap-computed packs into files
(even if you later take a bunch of those files and pull them into a
single bitmapped, as long as that final "all into one" does the walk).

Or I guess another way to think about it would be that if you're
computing bitmaps, you'd want to do the actual traversal.

> Yes, it also make sense. I saw write_reused_pack() in upstream git just
> copy raw bytes from original to destination pack. You mentioned you have
> something better for pack reuse - in your patch queue, in two words, is
> it now reusing pack based on object, not raw bytes, or is it something
> else?
> 
> In other words in which way it works better? (I'm just curious here as
> it is interesting to know)

The problem with the existing pack-reuse code is that it doesn't kick in
often enough. I think it looks to see that the client wants some
percentage of the pack (e.g., 90%), and then just sends the whole
beginning. This works especially badly if you have a bunch of related
repositories packed together (e.g., all of the forks of torvalds/linux
on GitHub), because you'll never hit 90% of that big pack; it has too
much unrelated cruft, even if most of the stuff you want _is_ at the
beginning. And "percent of pack" is not really a useful metric anyway.

So the better scheme is more like:

  1. Generate the bitmap of objects to send using reachability bitmaps.

  2. Do a quick scan of their content in the packfile to see which can
 be reused verbatim. If they're base objects, we can send them
 as-is. If they're deltas, we can send them if their base is going
 to be sent. This fills in another bitmap of "reusable" objects.

 After a long string of unusable objects, you can give up and set
 the rest of the bitmap to zeroes.

  3. Walk the "reuse" bitmap and send out the objects more-or-less
 verbatim. You do have make adjustments to delta-base-offsets for
 any "holes" (so if an object's entry says "my base is 500 bytes
 back", but you omitted some objects in between, you have to adjust
 that offset).

The upside is that you can send out those objects without even making a
"struct object_entry" for them, which drastically reduces the memory
requirements for serving a clone. Any objects which didn't get marked
for reuse just get handled in the usual way (so stuff that was not close
by in the pack, or stuff that was pushed since your last big repack).

The downside is that because those objects aren't in our normal packing
list, they're not available as delta bases for the new objects we _do_
send. So it can make the resulting pack a little bit bigger.

> Yes, for packing it is only hash which is used. And I assume name-hash
> for bitmap is not enabled by default for compatibility with JGit code.
>
> It would make sense to me to eventually enable name-hash bitmap
> extension by default, as packing result is much better with it. And
> those who care about compatibility with JGit can just turn it off in
> their git config.

Correct, the defaults are for JGit compatibility. If you are not using
JGit, you should have it on all the time.  We went with the conservative
default, but as more people using regular Git bitmaps, it would probably
be good to make them less arcane and confusing to use.

> > As I understand your use case, it is OK to do the less careful things.
> > It's just that pack-objects until now has been split into two modes:
> > packing to a file is careful, and packing to stdout is less so. And you
> > want to pack to a file in the non-careful mode.
> 
> Yes, it should be ok, as after repository extraction git-backup
> verifies rev-list for all refs
> 
> https://lab.nexedi.com/kirr/git-backup/blob/7fcb8c67/git-backup.go#L855
> 
> And if an object is missing - e.g. a blob - rev-list complains:
> 
> fatal: missing blob object '980a0d5f19a64b4b30a87d4206aade58726b60e3'
> 
> though it does not catch blob corruptions.

Right, that makes sense. Even the pack-to-disk code invoked by
git-repack is not foolproof for blob corruptions. It is only checking a
crc, not the full sha1. So it's better than nothing, but not as careful
as a full index-pack.

> Probably pack.useBitmaps is of no use in normal situation, but for
> debugging 

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-19 Thread Kirill Smelkov
On Tue, Jul 19, 2016 at 05:29:07AM -0600, Jeff King wrote:
> On Sun, Jul 17, 2016 at 08:06:49PM +0300, Kirill Smelkov wrote:
> 
> > > Anyway, please find below updated patch according to your suggestion.
> > > Hope it is ok now.
> > 
> > Ping. Is the patch ok or something needs to be improved still?
> 
> Sorry, I'm traveling and haven't carefully reviewed it yet. It's still
> on my list, but it may be a few days.

Jeff thanks for feedback. Have a good traveling and good to know the patch
was not forgotten. I will be waiting for the time while you are on trip.

Thanks again for feedback,
Kirill
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-19 Thread Jeff King
On Sun, Jul 17, 2016 at 08:06:49PM +0300, Kirill Smelkov wrote:

> > Anyway, please find below updated patch according to your suggestion.
> > Hope it is ok now.
> 
> Ping. Is the patch ok or something needs to be improved still?

Sorry, I'm traveling and haven't carefully reviewed it yet. It's still
on my list, but it may be a few days.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-17 Thread Kirill Smelkov
On Wed, Jul 13, 2016 at 01:52:16PM +0300, Kirill Smelkov wrote:
> On Wed, Jul 13, 2016 at 04:26:53AM -0400, Jeff King wrote:
> > On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> > 
> > > >   - we will not compute the same write order (which is based on
> > > > traversal order), leading to packs that have less efficient cache
> > > > characteristics
> > > 
> > > I agree the order can be not exactly the same. Still if original pack is
> > > packed well (with good recency order), while using bitmap we will tend
> > > to traverse it in close to original order.
> > > 
> > > Maybe I'm not completely right on this, but to me it looks to be the
> > > case because if objects in original pack are put there linearly sorted
> > > by recency order, and we use bitmap index to set of all reachable
> > > objects from a root, and then just _linearly_ gather all those objects
> > > from original pack by 1s in bitmap and put them in the same order into
> > > destination pack, the recency order won't be broken.
> > > 
> > > Or am I maybe misunderstanding something?
> > 
> > Yeah, I think you can go some of the way by reusing the order from the
> > old pack. But keep in mind that the bitmap result may also contain
> > objects that are not yet packed. Those will just come in a big lump at
> > the end of the bitmap (these are the "extended entries" in the bitmap
> > code).
> > 
> > So I think if you were to repeatedly "git repack -adb" over time, you
> > would get worse and worse ordering as objects are added to the
> > repository.
> 
> Jeff, first of all thanks for clarifying.
> 
> So it is not-yet-packed-objects which make packing with bitmap less
> efficient. I was originally keeping in mind fresh repacked repository
> with just built bitmap index and for that case extracting pack with
> bitmap index seems to be just ok, but the more not-yet-packed objects we
> have the worse the result can be.
> 
> > As an aside, two other things that pack order matters for: it makes the
> > bitmaps themselves compress better (because it increases locality of
> > reachability, so you get nice runs of "1" or "0" bits).
> 
> Yes I agree and thanks for bringing this up - putting objects in recency
> order in pack also makes bitmap index to have larger runs of same 1 or 0.
> 
> > It also makes
> > the pack-reuse code more efficient (since in an ideal case, you can just
> > dump a big block of data from the front of the pack). Note that the
> > pack-reuse code that's in upstream git isn't that great; I have a better
> > system on my big pile of patches to send upstream (that never seems to
> > get smaller; ).
> 
> Yes, it also make sense. I saw write_reused_pack() in upstream git just
> copy raw bytes from original to destination pack. You mentioned you have
> something better for pack reuse - in your patch queue, in two words, is
> it now reusing pack based on object, not raw bytes, or is it something
> else?
> 
> In other words in which way it works better? (I'm just curious here as
> it is interesting to know)
> 
> 
> > > >   - we don't learn about the filename of trees and blobs, which is going
> > > > to make the delta step much less efficient. This might be mitigated
> > > > by turning on the bitmap name-hash cache; I don't recall how much
> > > > detail pack-objects needs on the name (i.e., the full name versus
> > > > just the hash).
> > > 
> > > If I understand it right, it uses only uint32_t name hash while 
> > > searching. From
> > > pack-objects.{h,c} :
> > 
> > Yeah, I think you are right. Not having the real names is a problem for
> > doing rev-list output, but I think pack-objects doesn't care (though do
> > note that the name-hash cache is not enabled by default).
> 
> Yes, for packing it is only hash which is used. And I assume name-hash
> for bitmap is not enabled by default for compatibility with JGit code.
> 
> It would make sense to me to eventually enable name-hash bitmap
> extension by default, as packing result is much better with it. And
> those who care about compatibility with JGit can just turn it off in
> their git config.
> 
> Just my thoughts.
> 
> > > > There may be other subtle things, too. The general idea of tying the
> > > > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > > > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > > > more important to generate a good pack.
> > > 
> > > It is better we send good packs to clients too, right? And with
> > > pack.writeBitmapHashCache=true and retaining recency order (please see
> > > above, but again maybe I'm not completely right) to me we should be still
> > > generating a good pack while using bitmap reachability index for object
> > > graph traversal.
> > 
> > We do want to send the client a good pack, but it's always a tradeoff.
> > We could spend much more time searching for the perfect delta, but at
> > some point we have to decide on how much CPU to spend serving them.
> > Li

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-13 Thread Kirill Smelkov
On Wed, Jul 13, 2016 at 04:26:53AM -0400, Jeff King wrote:
> On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> 
> > >   - we will not compute the same write order (which is based on
> > > traversal order), leading to packs that have less efficient cache
> > > characteristics
> > 
> > I agree the order can be not exactly the same. Still if original pack is
> > packed well (with good recency order), while using bitmap we will tend
> > to traverse it in close to original order.
> > 
> > Maybe I'm not completely right on this, but to me it looks to be the
> > case because if objects in original pack are put there linearly sorted
> > by recency order, and we use bitmap index to set of all reachable
> > objects from a root, and then just _linearly_ gather all those objects
> > from original pack by 1s in bitmap and put them in the same order into
> > destination pack, the recency order won't be broken.
> > 
> > Or am I maybe misunderstanding something?
> 
> Yeah, I think you can go some of the way by reusing the order from the
> old pack. But keep in mind that the bitmap result may also contain
> objects that are not yet packed. Those will just come in a big lump at
> the end of the bitmap (these are the "extended entries" in the bitmap
> code).
> 
> So I think if you were to repeatedly "git repack -adb" over time, you
> would get worse and worse ordering as objects are added to the
> repository.

Jeff, first of all thanks for clarifying.

So it is not-yet-packed-objects which make packing with bitmap less
efficient. I was originally keeping in mind fresh repacked repository
with just built bitmap index and for that case extracting pack with
bitmap index seems to be just ok, but the more not-yet-packed objects we
have the worse the result can be.

> As an aside, two other things that pack order matters for: it makes the
> bitmaps themselves compress better (because it increases locality of
> reachability, so you get nice runs of "1" or "0" bits).

Yes I agree and thanks for bringing this up - putting objects in recency
order in pack also makes bitmap index to have larger runs of same 1 or 0.

> It also makes
> the pack-reuse code more efficient (since in an ideal case, you can just
> dump a big block of data from the front of the pack). Note that the
> pack-reuse code that's in upstream git isn't that great; I have a better
> system on my big pile of patches to send upstream (that never seems to
> get smaller; ).

Yes, it also make sense. I saw write_reused_pack() in upstream git just
copy raw bytes from original to destination pack. You mentioned you have
something better for pack reuse - in your patch queue, in two words, is
it now reusing pack based on object, not raw bytes, or is it something
else?

In other words in which way it works better? (I'm just curious here as
it is interesting to know)


> > >   - we don't learn about the filename of trees and blobs, which is going
> > > to make the delta step much less efficient. This might be mitigated
> > > by turning on the bitmap name-hash cache; I don't recall how much
> > > detail pack-objects needs on the name (i.e., the full name versus
> > > just the hash).
> > 
> > If I understand it right, it uses only uint32_t name hash while searching. 
> > From
> > pack-objects.{h,c} :
> 
> Yeah, I think you are right. Not having the real names is a problem for
> doing rev-list output, but I think pack-objects doesn't care (though do
> note that the name-hash cache is not enabled by default).

Yes, for packing it is only hash which is used. And I assume name-hash
for bitmap is not enabled by default for compatibility with JGit code.

It would make sense to me to eventually enable name-hash bitmap
extension by default, as packing result is much better with it. And
those who care about compatibility with JGit can just turn it off in
their git config.

Just my thoughts.

> > > There may be other subtle things, too. The general idea of tying the
> > > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > > more important to generate a good pack.
> > 
> > It is better we send good packs to clients too, right? And with
> > pack.writeBitmapHashCache=true and retaining recency order (please see
> > above, but again maybe I'm not completely right) to me we should be still
> > generating a good pack while using bitmap reachability index for object
> > graph traversal.
> 
> We do want to send the client a good pack, but it's always a tradeoff.
> We could spend much more time searching for the perfect delta, but at
> some point we have to decide on how much CPU to spend serving them.
> Likewise, even if the bitmapped packs we send are in slightly worse
> order, saving a minute of CPU time off of every clone of the kernel is a
> big deal.

Yes, this I understand and agree. Like I said above I was imagining
freshly repacked repo with recently rebuilt 

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-13 Thread Jeff King
On Tue, Jul 12, 2016 at 10:08:08PM +0300, Kirill Smelkov wrote:

> > Or are we fine with my arguments about recency order staying the same
> > when using bitmap reachability index for object graph traversal, and this
> > way the patch is fine to go in as it is?
> 
> Since there is no reply I assume the safe way to go is to let default
> for pack-to-file case to be "not using bitmap index". Please find updated
> patch and interdiff below. I would still be grateful for feedback on
> my above use-bitmap-for-pack-to-file arguments.

Yeah, I think that is a reasonable approach. I see here you've added new
config, though, and I don't think we want that.

For your purposes, where you're driving pack-objects individually, I
think a command-line option makes more sense.

If we did want to have a flag for "use bitmaps when repacking via
repack", I think it should be "repack.useBitmaps", and git-repack should
pass the command-line option to pack-objects. pack-objects is porcelain
and should not really be reading config at all. You'll note that
pack.writeBitmaps was a mistake and got deprecated in favor of
repack.writeBitmaps. I think pack.useBitmaps is a mistake, too, but
nobody has really noticed or cared because there's no good reason to set
it (the more interesting question is: are there bitmaps available? and
if so, we try to use them).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-13 Thread Jeff King
On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:

> >   - we will not compute the same write order (which is based on
> > traversal order), leading to packs that have less efficient cache
> > characteristics
> 
> I agree the order can be not exactly the same. Still if original pack is
> packed well (with good recency order), while using bitmap we will tend
> to traverse it in close to original order.
> 
> Maybe I'm not completely right on this, but to me it looks to be the
> case because if objects in original pack are put there linearly sorted
> by recency order, and we use bitmap index to set of all reachable
> objects from a root, and then just _linearly_ gather all those objects
> from original pack by 1s in bitmap and put them in the same order into
> destination pack, the recency order won't be broken.
> 
> Or am I maybe misunderstanding something?

Yeah, I think you can go some of the way by reusing the order from the
old pack. But keep in mind that the bitmap result may also contain
objects that are not yet packed. Those will just come in a big lump at
the end of the bitmap (these are the "extended entries" in the bitmap
code).

So I think if you were to repeatedly "git repack -adb" over time, you
would get worse and worse ordering as objects are added to the
repository.

As an aside, two other things that pack order matters for: it makes the
bitmaps themselves compress better (because it increases locality of
reachability, so you get nice runs of "1" or "0" bits). It also makes
the pack-reuse code more efficient (since in an ideal case, you can just
dump a big block of data from the front of the pack). Note that the
pack-reuse code that's in upstream git isn't that great; I have a better
system on my big pile of patches to send upstream (that never seems to
get smaller; ).

> >   - we don't learn about the filename of trees and blobs, which is going
> > to make the delta step much less efficient. This might be mitigated
> > by turning on the bitmap name-hash cache; I don't recall how much
> > detail pack-objects needs on the name (i.e., the full name versus
> > just the hash).
> 
> If I understand it right, it uses only uint32_t name hash while searching. 
> From
> pack-objects.{h,c} :

Yeah, I think you are right. Not having the real names is a problem for
doing rev-list output, but I think pack-objects doesn't care (though do
note that the name-hash cache is not enabled by default).

> > There may be other subtle things, too. The general idea of tying the
> > bitmap use to pack_to_stdout is that you _do_ want to use it for
> > serving fetches and pushes, but for a full on-disk repack via gc, it's
> > more important to generate a good pack.
> 
> It is better we send good packs to clients too, right? And with
> pack.writeBitmapHashCache=true and retaining recency order (please see
> above, but again maybe I'm not completely right) to me we should be still
> generating a good pack while using bitmap reachability index for object
> graph traversal.

We do want to send the client a good pack, but it's always a tradeoff.
We could spend much more time searching for the perfect delta, but at
some point we have to decide on how much CPU to spend serving them.
Likewise, even if the bitmapped packs we send are in slightly worse
order, saving a minute of CPU time off of every clone of the kernel is a
big deal.

We also take robustness shortcuts when sending to clients. For example,
when doing an on-disk repack we re-crc32 all of the delta data we are
reusing, even if we don't actually inflate it (because we would want to
stop immediately if we see even a single bit flipped on disk). But we
don't check them when sending to a client, because we know they are
going to actually `index-pack` it and get a stronger consistency check
anyway, and don't want to waste server CPU.

The bitmaps are sort of the same. If there is a bug or corruption in the
bitmap, the worst case is that we send a broken pack to the client, who
will complain that we did not give them all of the objects. It's a
momentary problem that can be fixed. If you use them for an on-disk
repack, then the next step is usually to delete all of the old packs. So
a corruption there carries forward, and is irreversible.

As I understand your use case, it is OK to do the less careful things.
It's just that pack-objects until now has been split into two modes:
packing to a file is careful, and packing to stdout is less so. And you
want to pack to a file in the non-careful mode.

> > but I wonder if you should be using "pack-objects --stdout" yourself.
> 
> I already tried --stdout. The problem is on repository extraction we
> need to both extract the pack and index it. While `pack-object file`
> does both, for --stdout case we need to additionally index extracted
> pack with `git index-pack`, and standalone `git index-pack` is very slow
> - in my experience much slower than generating the pack itself:

Ah, right, that 

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-12 Thread Kirill Smelkov
On Fri, Jul 08, 2016 at 01:38:55PM +0300, Kirill Smelkov wrote:
> Peff first of all thanks for feedback,
> 
> On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote:
> > On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:
> > 
> > > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> > > if a repository has bitmap index, pack-objects can nicely speedup
> > > "Counting objects" graph traversal phase. That however was done only for
> > > case when resultant pack is sent to stdout, not written into a file.
> > > 
> > > We can teach pack-objects to use bitmap index for initial object
> > > counting phase when generating resultant pack file too:
> > 
> > I'm not sure this is a good idea in general. When bitmaps are in use, we
> > cannot fill out the details in the object-packing list as thoroughly. In
> > particular:
> > 
> >   - we will not compute the same write order (which is based on
> > traversal order), leading to packs that have less efficient cache
> > characteristics
> 
> I agree the order can be not exactly the same. Still if original pack is
> packed well (with good recency order), while using bitmap we will tend
> to traverse it in close to original order.
> 
> Maybe I'm not completely right on this, but to me it looks to be the
> case because if objects in original pack are put there linearly sorted
> by recency order, and we use bitmap index to set of all reachable
> objects from a root, and then just _linearly_ gather all those objects
> from original pack by 1s in bitmap and put them in the same order into
> destination pack, the recency order won't be broken.
> 
> Or am I maybe misunderstanding something?
> 
> Please also see below:
> 
> >   - we don't learn about the filename of trees and blobs, which is going
> > to make the delta step much less efficient. This might be mitigated
> > by turning on the bitmap name-hash cache; I don't recall how much
> > detail pack-objects needs on the name (i.e., the full name versus
> > just the hash).
> 
> If I understand it right, it uses only uint32_t name hash while searching. 
> From
> pack-objects.{h,c} :
> 
>  8< 
> struct object_entry {
>   ...
>   uint32_t hash;  /* name hint hash */
> 
> 
> /*
>  * We search for deltas in a list sorted by type, by filename hash, and then
>  * by size, so that we see progressively smaller and smaller files.
>  * That's because we prefer deltas to be from the bigger file
>  * to the smaller -- deletes are potentially cheaper, but perhaps
>  * more importantly, the bigger file is likely the more recent
>  * one.  The deepest deltas are therefore the oldest objects which are
>  * less susceptible to be accessed often.
>  */
> static int type_size_sort(const void *_a, const void *_b)
> {
> const struct object_entry *a = *(struct object_entry **)_a;
> const struct object_entry *b = *(struct object_entry **)_b;
> 
> if (a->type > b->type)
> return -1;
> if (a->type < b->type) 
> return 1;
> if (a->hash > b->hash)
> return -1;
> if (a->hash < b->hash)
> return 1;
>   ...
>  8< 
> 
> Documentation/technical/pack-heuristics.txt also confirms this:
> 
>  8< 
> ...
>  The quote from the above linus should be rewritten a
> bit (wait for it):
> - first sort by type.  Different objects never delta with
>   each other.
> - then sort by filename/dirname.  hash of the basename
>   occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
>   DIR_BITS are for the hash of leading path elements.
> 
> ...
> 
> If I might add, the trick is to make files that _might_ be similar be
> located close to each other in the hash buckets based on their file
> names.  It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
> "Makefile" all landed in the same bucket due to their common basename,
> "Makefile". However, now they land in "close" buckets.
> 
> The algorithm allows not just for the _same_ bucket, but for _close_
> buckets to be considered delta candidates.  The rationale is
> essentially that files, like Makefiles, often have very similar
> content no matter what directory they live in.
>  8< 
> 
> 
> So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the
> delta-search heuristics is almost as efficient as with just raw filenames.
> 
> I can confirm this also via e.g. (with my patch applied) :
> 
>  8< 
> $ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs 
> erp5pack-plain
> Counting objects: 627171, done.
> Compressing objects: 100% (176949/176949), done.
> 50570987560d481742af4a8083028c2322a0534a
> Writing objects: 100% (627171/627171), done.
> Total 627171 (delta 439404), reused 594820 (delta 410210)
> 
> real0m37.272s
> user0m33.648s
> sys 

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-08 Thread Kirill Smelkov
Peff first of all thanks for feedback,

On Thu, Jul 07, 2016 at 04:52:23PM -0400, Jeff King wrote:
> On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:
> 
> > Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> > if a repository has bitmap index, pack-objects can nicely speedup
> > "Counting objects" graph traversal phase. That however was done only for
> > case when resultant pack is sent to stdout, not written into a file.
> > 
> > We can teach pack-objects to use bitmap index for initial object
> > counting phase when generating resultant pack file too:
> 
> I'm not sure this is a good idea in general. When bitmaps are in use, we
> cannot fill out the details in the object-packing list as thoroughly. In
> particular:
> 
>   - we will not compute the same write order (which is based on
> traversal order), leading to packs that have less efficient cache
> characteristics

I agree the order can be not exactly the same. Still if original pack is
packed well (with good recency order), while using bitmap we will tend
to traverse it in close to original order.

Maybe I'm not completely right on this, but to me it looks to be the
case because if objects in original pack are put there linearly sorted
by recency order, and we use bitmap index to set of all reachable
objects from a root, and then just _linearly_ gather all those objects
from original pack by 1s in bitmap and put them in the same order into
destination pack, the recency order won't be broken.

Or am I maybe misunderstanding something?

Please also see below:

>   - we don't learn about the filename of trees and blobs, which is going
> to make the delta step much less efficient. This might be mitigated
> by turning on the bitmap name-hash cache; I don't recall how much
> detail pack-objects needs on the name (i.e., the full name versus
> just the hash).

If I understand it right, it uses only uint32_t name hash while searching. From
pack-objects.{h,c} :

 8< 
struct object_entry {
...
uint32_t hash;  /* name hint hash */


/*
 * We search for deltas in a list sorted by type, by filename hash, and then
 * by size, so that we see progressively smaller and smaller files.
 * That's because we prefer deltas to be from the bigger file
 * to the smaller -- deletes are potentially cheaper, but perhaps
 * more importantly, the bigger file is likely the more recent
 * one.  The deepest deltas are therefore the oldest objects which are
 * less susceptible to be accessed often.
 */
static int type_size_sort(const void *_a, const void *_b)
{
const struct object_entry *a = *(struct object_entry **)_a;
const struct object_entry *b = *(struct object_entry **)_b;

if (a->type > b->type)
return -1;
if (a->type < b->type) 
return 1;
if (a->hash > b->hash)
return -1;
if (a->hash < b->hash)
return 1;
...
 8< 

Documentation/technical/pack-heuristics.txt also confirms this:

 8< 
...
 The quote from the above linus should be rewritten a
bit (wait for it):
- first sort by type.  Different objects never delta with
  each other.
- then sort by filename/dirname.  hash of the basename
  occupies the top BITS_PER_INT-DIR_BITS bits, and bottom
  DIR_BITS are for the hash of leading path elements.

...

If I might add, the trick is to make files that _might_ be similar be
located close to each other in the hash buckets based on their file
names.  It used to be that "foo/Makefile", "bar/baz/quux/Makefile" and
"Makefile" all landed in the same bucket due to their common basename,
"Makefile". However, now they land in "close" buckets.

The algorithm allows not just for the _same_ bucket, but for _close_
buckets to be considered delta candidates.  The rationale is
essentially that files, like Makefiles, often have very similar
content no matter what directory they live in.
 8< 


So yes, exactly as you say with pack.writeBitmapHashCache=true (ae4f07fb) the
delta-search heuristics is almost as efficient as with just raw filenames.

I can confirm this also via e.g. (with my patch applied) :

 8< 
$ time echo 0186ac99 | git pack-objects --no-use-bitmap-index --revs 
erp5pack-plain
Counting objects: 627171, done.
Compressing objects: 100% (176949/176949), done.
50570987560d481742af4a8083028c2322a0534a
Writing objects: 100% (627171/627171), done.
Total 627171 (delta 439404), reused 594820 (delta 410210)

real0m37.272s
user0m33.648s
sys 0m1.580s

$ time echo 0186ac99 | git pack-objects --revs erp5pack-bitmap
Counting objects: 627171, done.
Compressing objects: 100% (176914/176914), done.
7c15a9b1eca1326e679297b217c5a48954625ca2
Writing objects: 100% (627171/627171), done.
Total 627171 (delta 439484), reused 594855 (delta 410245)

re

Re: [PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-07 Thread Jeff King
On Thu, Jul 07, 2016 at 10:09:17PM +0300, Kirill Smelkov wrote:

> Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
> if a repository has bitmap index, pack-objects can nicely speedup
> "Counting objects" graph traversal phase. That however was done only for
> case when resultant pack is sent to stdout, not written into a file.
> 
> We can teach pack-objects to use bitmap index for initial object
> counting phase when generating resultant pack file too:

I'm not sure this is a good idea in general. When bitmaps are in use, we
cannot fill out the details in the object-packing list as thoroughly. In
particular:

  - we will not compute the same write order (which is based on
traversal order), leading to packs that have less efficient cache
characteristics

  - we don't learn about the filename of trees and blobs, which is going
to make the delta step much less efficient. This might be mitigated
by turning on the bitmap name-hash cache; I don't recall how much
detail pack-objects needs on the name (i.e., the full name versus
just the hash).

There may be other subtle things, too. The general idea of tying the
bitmap use to pack_to_stdout is that you _do_ want to use it for
serving fetches and pushes, but for a full on-disk repack via gc, it's
more important to generate a good pack.

Your use case:

> git-backup extracts many packs on repositories restoration. That was my
> initial motivation for the patch.

Seems to be somewhere in between. I'm not sure I understand how you're
invoking pack-objects here, but I wonder if you should be using
"pack-objects --stdout" yourself.

But even if it is the right thing for your use case to be using bitmaps
to generate an on-disk bitmap, I think we should be making sure it
_doesn't_ trigger when doing a normal repack.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] pack-objects: Use reachability bitmap index when generating non-stdout pack too

2016-07-07 Thread Kirill Smelkov
Starting from 6b8fda2d (pack-objects: use bitmaps when packing objects)
if a repository has bitmap index, pack-objects can nicely speedup
"Counting objects" graph traversal phase. That however was done only for
case when resultant pack is sent to stdout, not written into a file.

We can teach pack-objects to use bitmap index for initial object
counting phase when generating resultant pack file too:

- if we know bitmap index generation is not enabled for resultant pack:

  Current code has singleton bitmap_git so cannot work simultaneously
  with two bitmap indices.

- if we keep pack reuse enabled still only for "send-to-stdout" case:

  Because on pack reuse raw entries are directly written out to destination
  pack by write_reused_pack() bypassing needed for pack index generation
  bookkeeping done by regular codepath in write_one() and friends.

  (at least that's my understanding after briefly looking at the code)

We also need to care and teach add_object_entry_from_bitmap() to respect
--local via not adding nonlocal loose object to resultant pack (this
is bitmap-codepath counterpart of daae0625 (pack-objects: extend --local
to mean ignore non-local loose objects too) -- not to break 'loose
objects in alternate ODB are not repacked' in t7700-repack.sh .

Otherwise all git tests pass, and for pack-objects -> file we get nice
speedup:

erp5.git[1] (~230MB) extracted from ~ 5GB lab.nexedi.com backup
repository managed by git-backup[2] via

time echo 0186ac99 | git pack-objects --revs erp5pack

before:  37.2s
after:   26.2s

And for `git repack -adb` packed git.git

time echo 5c589a73 | git pack-objects --revs gitpack

before:   7.1s
after:3.6s

i.e. it can be 30% - 50% speedup for pack extraction.

git-backup extracts many packs on repositories restoration. That was my
initial motivation for the patch.

[1] https://lab.nexedi.com/nexedi/erp5
[2] https://lab.nexedi.com/kirr/git-backup

Cc: Vicent Marti 
Cc: Jeff King 
Signed-off-by: Kirill Smelkov 
---
 builtin/pack-objects.c  | 7 +--
 t/t5310-pack-bitmaps.sh | 9 +
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a2f8cfd..be0ebe8 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1052,6 +1052,9 @@ static int add_object_entry_from_bitmap(const unsigned 
char *sha1,
 {
uint32_t index_pos;
 
+   if (local && has_loose_object_nonlocal(sha1))
+   return 0;
+
if (have_duplicate_entry(sha1, 0, &index_pos))
return 0;
 
@@ -2488,7 +2491,7 @@ static int get_object_list_from_bitmap(struct rev_info 
*revs)
if (prepare_bitmap_walk(revs) < 0)
return -1;
 
-   if (pack_options_allow_reuse() &&
+   if (pack_options_allow_reuse() && pack_to_stdout &&
!reuse_partial_packfile_from_bitmap(
&reuse_packfile,
&reuse_packfile_objects,
@@ -2773,7 +2776,7 @@ int cmd_pack_objects(int argc, const char **argv, const 
char *prefix)
if (!rev_list_all || !rev_list_reflog || !rev_list_index)
unpack_unreachable_expiration = 0;
 
-   if (!use_internal_rev_list || !pack_to_stdout || 
is_repository_shallow())
+   if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) 
|| is_repository_shallow())
use_bitmap_index = 0;
 
if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5310-pack-bitmaps.sh b/t/t5310-pack-bitmaps.sh
index 3893afd..533fc31 100755
--- a/t/t5310-pack-bitmaps.sh
+++ b/t/t5310-pack-bitmaps.sh
@@ -118,6 +118,15 @@ test_expect_success 'incremental repack can disable 
bitmaps' '
git repack -d --no-write-bitmap-index
 '
 
+test_expect_success 'pack-objects to file can use bitmap' '
+   # make sure we still have 1 bitmap index from previous tests
+   ls .git/objects/pack/ | grep bitmap >output &&
+   test_line_count = 1 output &&
+   # pack-objects uses bitmap index by default, when it is available
+   packsha1=$(git pack-objects --all mypack output &&
-- 
2.9.0.431.gb11dac7.dirty
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html