On Fri, Mar 16, 2018 at 08:48:49PM +0100, SZEDER Gábor wrote:

> I came up with a different explanation back then: we are only interested
> in commit objects when creating the commit graph, and only a small-ish
> fraction of all objects are commit objects, so the "enumerate objects in
> packfiles" approach has to look at a lot more objects:
> 
>   # in my git fork
>   $ git rev-list --all --objects |cut -d' ' -f1 |\
>     git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
>   $ grep -c ^commit type-size
>   53754
>   $ wc -l type-size
>   244723 type-size
> 
> I.e. only about 20% of all objects are commit objects.
> 
> Furthermore, in order to look at an object it has to be zlib inflated
> first, and since commit objects tend to be much smaller than trees and
> especially blobs, there are a lot less bytes to inflate:
> 
>   $ grep ^commit type-size |cut -d' ' -f2 |avg
>   34395730 / 53754 = 639
>   $ cat type-size |cut -d' ' -f2 |avg
>   3866685744 / 244723 = 15800
> 
> So a simple revision walk inflates less than 1% of the bytes that the
> "enumerate objects packfiles" approach has to inflate.

I don't think this is quite accurate. It's true that we have to
_consider_ every object, but Git is smart enough not to inflate each one
to find its type. For loose objects we just inflate the header. For
packed objects, we either pick the type directly out of the packfile
header (for a non-delta) or can walk the delta chain (without actually
looking at the data bytes!) until we hit the base.

So starting from scratch:

  git cat-file --batch-all-objects --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

is in the same ballpark for most repos as:

  git rev-list --all |
  git cat-file --batch

though in my timings the traversal is a little bit faster (and I'd
expect that to remain the case when doing it all in a single process,
since the traversal only follows commit links, whereas processing the
object list has to do the type lookup for each object before deciding
whether to inflate it).

I'm not sure, though, if that edge would remain for incremental updates.
For instance, after we take in some new objects via "fetch", the
traversal strategy would want to do something like:

  git rev-list $new_tips --not --all |
  git cat-file --batch

whose performance will depend on the refs _currently_ in the repository,
as we load them as UNINTERESTING tips for the walk. Whereas doing:

  git show-index <.git/objects/pack/the-one-new-packfile.idx |
  cut -d' ' -f2 |
  git cat-file --batch-check='%(objecttype) %(objectname)' |
  grep ^commit |
  cut -d' ' -f2 |
  git cat-file --batch

always scales exactly with the size of the new objects (obviously that's
kind of baroque and this would all be done internally, but I'm trying to
demonstrate the algorithmic complexity). I'm not sure what the plan
would be if we explode loose objects, though. ;)

-Peff

Reply via email to