On Fri, Mar 16, 2018 at 11:33:55AM -0700, Junio C Hamano wrote:

> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
> 
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
> 
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I coincidentally was doing some experiments in this area a few weeks
ago, and found a few things:

  1. The ordering makes a _huge_ difference for accessing trees and
     blobs.

  2. Pack order (not pack-idx order) is actually the best order, since
     it tends to follow the delta patterns (it's close to traversal
     order, but packs delta families more tightly).

  3. None of this really matters for commits, since we almost never
     store them as deltas anyway.

Here are a few experiments people can do themselves to demonstrate (my
numbers here are all from linux.git, which is sort of a wort-case
for bad ordering because its size stresses the default delta cache):

  [every object in sha1 order: slow]
  $ time git cat-file --batch-all-objects --batch >/dev/null
  real  8m44.041s
  user  8m31.359s
  sys   0m12.262s

  [every object from a traversal: faster, but --objects traversals are
   actually CPU heavy due to all of the hash lookups for each tree. Note
   not just wall-clock time but the CPU since it's split across two
   processes]
  $ time git rev-list --objects --all |
         cut -d' ' -f2 |
         git cat-file --batch >/dev/null
  real  1m2.667s
  user  0m58.537s
  sys   0m32.392s

  [every object in pack order: fastest. This is due to skipping the
   traversal overhead, and should use our delta cache quite efficiently.
   I'm assuming a single pack and no loose objects here, but the
   performance should generalize since accessing the "big" pack
   dominates]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
         sort -n |
         cut -d' ' -f2 |
         git cat-file --batch >/dev/null
  real  0m51.718s
  user  0m50.963s
  sys   0m7.068s

  [just commits, sha1 order: not horrible]
  $ time git cat-file --batch-all-objects --batch-check='%(objecttype) 
%(objectname)' |
         grep ^commit |
         cut -d' ' -f2 |
         git cat-file --batch >/dev/null
  real  0m8.115s
  user  0m14.033s
  sys   0m1.170s

  [just commits, pack order: slightly worse due to the extra piping, but
   obviously that could be done more quickly internally]
  $ time git show-index <$(ls .git/objects/pack/*.idx) |
         sort -n |
         cut -d' ' -f2 |
         git cat-file --batch-check='%(objecttype) %(objectname)' |
         grep ^commit |
         cut -d' ' -f2 |
         git cat-file --batch >/dev/null
  real  0m21.670s
  user  0m24.867s
  sys   0m9.600s

  [and the reason is that hardly any commits get deltas]
  $ git cat-file --batch-all-objects --batch-check='%(objecttype) %(deltabase)' 
|
    grep ^commit >commits
  $ wc -l commits
  692596
  $ grep -v '0000000000000000000000000000000000000000' commits | wc -l
  18856

For the purposes of this patch series, I don't think the order matters
much, since we're only dealing with commits. For doing --batch-check, I
think the sha1 ordering given by "cat-file --batch-all-objects" is
convenient, and doesn't have a big impact on performance. But it's
_awful_ for --batch. I think we may want to add a sorting option to just
return the objects in the original packfile order.

-Peff

Reply via email to