This patch series includes a rewrite of the previous
multi-pack-index RFC [1] using the feedback from the
commit-graph feature.

I based this series on 'next' as it requires the
recent object-store patches.

The multi-pack-index (MIDX) is explained fully in
the design document 'Documentation/technical/midx.txt'.
The short description is that the MIDX stores the
information from all of the IDX files in a pack
directory. The crucial design decision is that the
IDX files still exist, so we can fall back to the IDX
files if there is any issue with the MIDX (or core.midx
is set to false, or a user downgrades Git, etc.)

The MIDX feature has been part of our GVFS releases
for a few months (since the RFC). It has behaved well,
indexing over 31 million commits and trees across up
to 250 packfiles. These MIDX files are nearly 1GB in
size and take ~20 seconds to rewrite when adding new
IDX information. This ~20s mark is something I'd like
to improve, and I mention how to make the file
incremental (similar to split-index) in the design
document. I also want to make the commit-graph file
incremental, so I'd like to do that at the same time
after both the MIDX and commit-graph are stable.


Lookup Speedups
---------------

When looking for an object, Git uses an most-recently-
used (MRU) cache of packfiles. This does pretty well to
minimize the number of misses when searching through
packfiles for an object, especially if there is one
"big" packfile that contains most of the objets (so it
will rarely miss and is usually one of the first two
packfiles in the list). The MIDX does provide a way
to remove these misses, improving lookup time. However,
this lookup time greatly depends on the arrangement of
the packfiles.

For instance, if you take the Linux repository and repack
using `git repack -adfF --max-pack-size=128m` then all
commits will be in one packfile, all trees will be in
a small set of packfiles and organized well so 'git
rev-list --objects HEAD^{tree}' only inspects one or two
packfiles.

GVFS has the notion of a "prefetch packfile". These are
packfiles that are precomputed by cache servers to
contain the commits and trees introduced to the remote
each day. GVFS downloads these packfiles and places them
in an alternate. Since these are organized by "first
time introduced" and the working directory is so large,
the MRU misses are significant when performing a checkout
and updating the .git/index file.

To test the performance in this situation, I created a
script that organizes the Linux repository in a similar
fashion. I split the commit history into 50 parts by
creating branches on every 10,000 commits of the first-
parent history. Then, `git rev-list --objects A ^B`
provides the list of objects reachable from A but not B,
so I could send that to `git pack-objects` to create
these "time-based" packfiles. With these 50 packfiles
(deleting the old one from my fresh clone, and deleting
all tags as they were no longer on-disk) I could then
test 'git rev-list --objects HEAD^{tree}' and see:

        Before: 0.17s
        After:  0.13s
        % Diff: -23.5%

By adding logic to count hits and misses to bsearch_pack,
I was able to see that the command above calls that
method 266,930 times with a hit rate of 33%. The MIDX
has the same number of calls with a 100% hit rate.



Abbreviation Speedups
---------------------

To fully disambiguate an abbreviation, we must iterate
through all packfiles to ensure no collision exists in
any packfile. This requires O(P log N) time. With the
MIDX, this is only O(log N) time. Our standard test [2]
is 'git log --oneline --parents --raw' because it writes
many abbreviations while also doing a lot of other work
(walking commits and trees to compute the raw diff).

For a copy of the Linux repository with 50 packfiles
split by time, we observed the following:

        Before: 100.5 s
        After:   58.2 s
        % Diff: -59.7%


Request for Review Attention
----------------------------

I tried my best to take the feedback from the commit-graph
feature and apply it to this feature. I also worked to
follow the object-store refactoring as I could. I also have
some local commits that create a 'verify' subcommand and
integrate with 'fsck' similar to the commit-graph, but I'll
leave those for a later series (and review is still underway
for that part of the commit-graph).

One place where I could use some guidance is related to the
current state of 'the_hash_algo' patches. The file format
allows a different "hash version" which then indicates the
length of the hash. What's the best way to ensure this
feature doesn't cause extra pain in the hash-agnostic series?
This will inform how I go back and make the commit-graph
feature better in this area, too.


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180107181459.222909-1-dsto...@microsoft.com/T/#u
    Previous MIDX RFC.

[2] https://public-inbox.org/git/20171012120220.226427-1-dsto...@microsoft.com/
    A patch series on abbreviation speedups


Derrick Stolee (23):
  midx: add design document
  midx: add midx format details to pack-format.txt
  midx: add midx builtin
  midx: add 'write' subcommand and basic wiring
  midx: write header information to lockfile
  midx: struct midxed_git and 'read' subcommand
  midx: expand test data
  midx: read packfiles from pack directory
  midx: write pack names in chunk
  midx: write a lookup into the pack names chunk
  midx: sort and deduplicate objects from packfiles
  midx: write object ids in a chunk
  midx: write object id fanout chunk
  midx: write object offsets
  midx: create core.midx config setting
  midx: prepare midxed_git struct
  midx: read objects from multi-pack-index
  midx: use midx in abbreviation calculations
  midx: use existing midx when writing new one
  midx: use midx in approximate_object_count
  midx: prevent duplicate packfile loads
  midx: use midx to find ref-deltas
  midx: clear midx on repack

 .gitignore                              |   1 +
 Documentation/config.txt                |   4 +
 Documentation/git-midx.txt              |  60 ++
 Documentation/technical/midx.txt        | 109 +++
 Documentation/technical/pack-format.txt |  82 +++
 Makefile                                |   2 +
 builtin.h                               |   1 +
 builtin/midx.c                          |  88 +++
 builtin/repack.c                        |   8 +
 cache.h                                 |   1 +
 command-list.txt                        |   1 +
 config.c                                |   5 +
 environment.c                           |   1 +
 git.c                                   |   1 +
 midx.c                                  | 923 ++++++++++++++++++++++++
 midx.h                                  |  23 +
 object-store.h                          |  35 +
 packfile.c                              |  47 +-
 packfile.h                              |   1 +
 sha1-name.c                             |  70 ++
 t/t5319-midx.sh                         | 192 +++++
 21 files changed, 1652 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 Documentation/technical/midx.txt
 create mode 100644 builtin/midx.c
 create mode 100644 midx.c
 create mode 100644 midx.h
 create mode 100755 t/t5319-midx.sh

-- 
2.18.0.rc1

Reply via email to