Re: [PATCH v4 05/10] commit-graph: always load commit-graph information

2018-05-01 Thread Derrick Stolee

On 4/29/2018 6:14 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Most code paths load commits using lookup_commit() and then
parse_commit().

And this automatically loads commit graph if needed, thanks to changes
in parse_commit_gently(), which parse_commit() uses.


 In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

I guess the problem is that we cannot just add parse_commit_in_graph()
like we did in parse_commit_gently(), for some reason?  Like for example
that parse_commit_gently() uses parse_commit_buffer() - which could have
been handled by moving parse_commit_in_graph() down the call chain from
parse_commit_gently() to parse_commit_buffer()... if not the fact that
check_commit() also uses parse_commit_buffer(), but it does not want to
load commit graph.  Am I right?


If a caller uses parse_commit_buffer() directly, then we will guarantee 
that all values in the struct commit that would be loaded from the 
buffer are loaded from the buffer. This means we do NOT load the root 
tree id or commit date from the commit-graph file. We do still need to 
load the data that is not available in the buffer, such as graph_pos and 
generation.





With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Is it generation number, or generation number and position in commit
graph?


We don't need to ensure the graph_pos (the commit will never be 
re-parsed, so we will not try to find it in the commit-graph file 
again), but we DO need to ensure the generation (or our commit walks 
will be incorrect). We get the graph_pos as a side-effect.





Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

I think this commit would be easier to review if it was split into pure
refactoring part (extracting fill_commit_graph_info() and
find_commit_in_graph()).  On the other hand the refactoring was needed
to reduce code duplication betweem existing parse_commit_in_graph() and
new load_commit_graph_info() functions.

I guess that the difference between parse_commit_in_graph() and
load_commit_graph_info() is that the former cares only about having just
enough information that is needed for parse_commit_gently() - and does
not load graph data if commit is parsed, while the latter is about
loading commit-graph data like generation numbers.


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 45 ++---
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 46 insertions(+), 20 deletions(-)

I wonder if it would be possible to add tests for this feature, for
example that commit-graph is read when it should (including those branch
lookups), and is not read when the feature should be disabled.

But the only way to test it I can think of is a stupid one: create
invalid commit graph, and check that git fails as expected (trying to
read said malformed file), and does not fail if commit graph feature is
disabled.

Let me reorder files (BTW, is there a way for Git to put *.h files
before *.c files in diff?) for easier review:


diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);
  
+/*

+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
  struct tree *get_commit_tree_in_graph(const struct commit *c);
  
  struct commit_graph {

diff --git a/commit-graph.c b/commit-graph.c
index 047fa9fca5..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}

The comment in the header file commit-graph.h talks about filling
graph_pos and generation members of the given commit, but I don't see
filling graph_pos member here.


We 

Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.

Here's a non-arbitrary reason: the object types are ordered
topologically (ignoring self-references):

tag -> commit, tree, blob
commit -> tree
tree -> blob

Thanks. I'll add a patch with that comment to v2.


@@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct 
object_id *oid,
ds.fn = NULL;
advise(_("The candidates are:"));
-   for_each_abbrev(ds.hex_pfx, show_ambiguous_object, );
+   for_each_abbrev(ds.hex_pfx, collect_ambiguous, );
+   QSORT(collect.oid, collect.nr, sort_ambiguous);

I was wondering how the old code sorted by SHA even when the ambiguous
objects were loaded from different sources (multiple pack-files, loose
objects). Turns out that for_each_abbrev() does its own sort after
collecting the SHAs and then calls the given function pointer only
once per distinct object. This avoids multiple instances of the same
object, which may appear multiple times across pack-files.

I only ask because now we are doing two sorts. I wonder if it would be
more elegant to provide your sorting algorithm to for_each_abbrev()
and let it call show_ambiguous_object as before.

Another question is if we should use this sort generally for all calls
to for_each_abbrev(). The only other case I see is in
builtin/revparse.c.

When preparing v2 I realized how confusing this was, so I'd added this
to the commit message of my WIP re-roll which should explain this:

 A note on the implementation: I started out with something much
 simpler which just replaced oid_array_sort() in sha1-array.c with a
 custom sort function before calling oid_array_for_each_unique(). But
 then dumbly noticed that it doesn't work because the output function
 was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
 drop duplicate objects", 2016-09-26) to ensure we don't display
 duplicate objects.
 
 That's why we're doing two passes here, first we need to sort the list

 and de-duplicate the objects, then sort them in our custom order, and
 finally output them without re-sorting them. I suppose we could also
 make oid_array_for_each_unique() maintain a hashmap of emitted
 objects, but that would increase its memory profile and wouldn't be
 worth the complexity for this one-off use-case,
 oid_array_for_each_unique() is used in many other places.


How would sorting in our custom order before de-duplicating fail the 
de-duplication? We will still pair identical OIDs as consecutive 
elements and oid_array_for_each_unique only cares about consecutive 
elements having distinct OIDs, not lex-ordered OIDs.


Perhaps the noise is because we rely on oid_array_sort() to mark the 
array as sorted inside oid_array_for_each_unique(), but that could be 
remedied by calling our QSORT() inside for_each_abbrev() and marking the 
array as sorted before calling oid_array_for_each_unique().


(Again, my comments are not meant to block this series.)

Thanks,
-Stolee


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

Change the output emitted when an ambiguous object is encountered so
that we show tags first, then commits, followed by trees, and finally
blobs. Within each type we show objects in hashcmp(). Before this
change the objects were only ordered by hashcmp().

The reason for doing this is that the output looks better as a result,
e.g. the v2.17.0 tag before this change on "git show e8f2" would
display:

 hint: The candidates are:
 hint:   e8f2093055 tree
 hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
 hint:   e8f21d02f7 blob
 hint:   e8f21d577c blob
 hint:   e8f25a3a50 tree
 hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
 hint:   e8f2650052 tag v2.17.0
 hint:   e8f2867228 blob
 hint:   e8f28d537c tree
 hint:   e8f2a35526 blob
 hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
 hint:   e8f2cf6ec0 tree

Now we'll instead show:

 hint:   e8f2650052 tag v2.17.0
 hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
 hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
 hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
 hint:   e8f2093055 tree
 hint:   e8f25a3a50 tree
 hint:   e8f28d537c tree
 hint:   e8f2cf6ec0 tree
 hint:   e8f21d02f7 blob
 hint:   e8f21d577c blob
 hint:   e8f2867228 blob
 hint:   e8f2a35526 blob

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.


Here's a non-arbitrary reason: the object types are ordered 
topologically (ignoring self-references):


tag -> commit, tree, blob
commit -> tree
tree -> blob


Signed-off-by: Ævar Arnfjörð Bjarmason 
---
  sha1-array.c | 15 +++
  sha1-array.h |  3 +++
  sha1-name.c  | 37 -
  3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf847..48bd9e9230 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array)
array->sorted = 0;
  }
  
+

+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data)
+{
+   int i;
+
+   for (i = 0; i < array->nr; i++) {
+   int ret = fn(array->oid + i, data);
+   if (ret)
+   return ret;
+   }
+   return 0;
+}
+
  int oid_array_for_each_unique(struct oid_array *array,
for_each_oid_fn fn,
void *data)
diff --git a/sha1-array.h b/sha1-array.h
index 1e1d24b009..232bf95017 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array);
  
  typedef int (*for_each_oid_fn)(const struct object_id *oid,

   void *data);
+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data);
  int oid_array_for_each_unique(struct oid_array *array,
  for_each_oid_fn fn,
  void *data);
diff --git a/sha1-name.c b/sha1-name.c
index 9d7bbd3e96..46d8b1afa6 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, 
void *data)
return 0;
  }
  
+static int sort_ambiguous(const void *a, const void *b)

+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/


I appreciate this comment. Clever things should be marked as such.


+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
  {
@@ -409,6 +437,8 @@ 

Re: [PATCH 0/9] get_short_oid UI improvements

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

I started out just wanting to do 04/09 so I'd get prettier output, but
then noticed that ^{tag}, ^{commit}< ^{blob} and ^{tree} didn't behave
as expected with the disambiguation output, and that core.disambiguate
had never been documented.

Ævar Arnfjörð Bjarmason (9):
   sha1-name.c: remove stray newline
   sha1-array.h: align function arguments
   sha1-name.c: move around the collect_ambiguous() function
   get_short_oid: sort ambiguous objects by type, then SHA-1
   get_short_oid: learn to disambiguate by ^{tag}
   get_short_oid: learn to disambiguate by ^{blob}
   get_short_oid / peel_onion: ^{tree} should mean tree, not treeish
   get_short_oid / peel_onion: ^{tree} should mean commit, not commitish
   config doc: document core.disambiguate

  Documentation/config.txt| 14 ++
  cache.h |  5 ++-
  sha1-array.c| 15 +++
  sha1-array.h|  7 ++-
  sha1-name.c | 69 -
  t/t1512-rev-parse-disambiguation.sh | 32 ++---
  6 files changed, 120 insertions(+), 22 deletions(-)



This is a good series. Please take a look at my suggestion in Patch 4/9, 
but feel free to keep this series as written.


Reviewed-by: Derrick Stolee <dsto...@microsoft.com>


Re: [PATCH v4 10/10] commit-graph.txt: update design document

2018-05-01 Thread Derrick Stolee

On 4/30/2018 7:32 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Looks good.


---
  Documentation/technical/commit-graph.txt | 30 +++-
  1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
  generation number and walk until reaching commits with known generation
  number.
  
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not

+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits,

The linux kernel commit graph has maximum of 513 commits sharing the
same generation number, but is is 5.43 commits sharing the same
generation number on average, with standard deviation 10.70; median is
even lower: it is 2, with 5.35 median absolute deviation (MAD).

So on average it would be a few extra commits.  Right.


   but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.

As I wrote before, handling those corner cases in more complicated, but
not that complicated.  We could simply use stronger condition if both
generation numbers are ordinary generation numbers, and weaker condition
when at least one generation number has one of those special values.


+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.

Ordinary generation numbers, where stronger condition holds, are those
between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX.


+
  Design Details
  --
  
@@ -98,17 +121,12 @@ Future Work

  - The 'commit-graph' subcommand does not have a "verify" mode that is
necessary for integration with fsck.
  
-- The file format includes room for precomputed generation numbers. These

-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-

Good.


  - After computing and storing generation numbers, we must make graph
walks aware of generation numbers to gain the performance benefits they
enable. This will mostly be accomplished by swapping a commit-date-ordered
priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
  
-- paint_down_to_common()

  - 'log --topo-order'

Another possible candidates:

- remove_redundant() - see comment in previous patch
- still_interesting() - where Git uses date slop to stop walking
  too far


remove_redundant() will be included in v5, thanks.

Instead of "still_interesting()" I'll add "git tag --merged" as the 
candidate to consider, as discussed in [1].


[1] https://public-inbox.org/git/87fu3g67ry@lant.ki.iif.hu/t/#u
    "branch --contains / tag --merged inconsistency"



  
  - Currently, parse_commit_gently() requires filling in the root tree

One important issue left is handling features that change view of
project history, and their interaction with commit-graph feature.

What would happen, if we turn on commit-graph feature, generate commit
graph file, and then:

   * use graft file or remove graft entries to cut history, or remove cut
 or join two [independent] histories.
   * use git-replace mechanims to do the same
   * in shallow clone, deepen or shorten the clone

What would happen if without re-generating commit-graph file (assuming
tha Git wouldn't do it f

Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee



On 5/1/2018 9:39 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


From: Ævar Arnfjörð Bjarmason <ava...@gmail.com>

Here is what I mean by sorting during for_each_abbrev(). This seems to work for
me, so I don't know what the issue is with this one-pass approach.
[...]
+static int sort_ambiguous(const void *a, const void *b)
+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/
+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
  {
@@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, 
void *cb_data)
find_short_object_filename();
find_short_packed_object();

+   QSORT(collect.oid, collect.nr, sort_ambiguous);
+   collect.sorted = 1;
+

Yes this works. You're right. I wasn't trying to intentionally omit
stuff in my recent 878t93zh60@evledraar.gmail.com, I'd just written
this code some days ago and forgotten why I did what I was doing (and
this is hard to test for), but it's all coming back to me now.

The actual requirement for oid_array_for_each_unique() working properly
is that you've got to feed it in hash order,


To work properly, duplicate entries must be consecutive. Since duplicate 
entries have the same type, our sort satisfies this condition.



but my new sort_ambiguous()
still does that (barring any SHA-1 collisions, at which point we have
bigger problems), so two passes aren't needed. So yes, this apporoach
works and is one-pass.

But that's just an implementation detail of the current sort method,
when I wrote this I was initially playing with other sort orders,
e.g. sorting SHAs regardless of type by the mtime of the file I found
them in. With this approach I'd start printing duplicates if I changed
the internals of sort_ambiguous() like that.


That makes sense.


But I think it's extremely implausible that we'll start sorting things
like that, so I'll just take this method of doing it and add some
comment saying we must hashcmp() the entries in our own sort function
for the de-duplication to work, I don't see us ever changing that.


Sounds good.

Thanks,
-Stolee


Re: [PATCH v5 09/11] commit: use generation number in remove_redundant()

2018-05-01 Thread Derrick Stolee



On 5/1/2018 8:47 AM, Derrick Stolee wrote:

The static remove_redundant() method is used to filter a list
of commits by removing those that are reachable from another
commit in the list. This is used to remove all possible merge-
bases except a maximal, mutually independent set.

To determine these commits are independent, we use a number of
paint_down_to_common() walks and use the PARENT1, PARENT2 flags
to determine reachability. Since we only care about reachability
and not the full set of merge-bases between 'one' and 'twos', we
can use the 'min_generation' parameter to short-circuit the walk.

When no commit-graph exists, there is no change in behavior.

For a copy of the Linux repository, we measured the following
performance improvements:

git merge-base v3.3 v4.5

Before: 234 ms
  After: 208 ms
  Rel %: -11%

git merge-base v4.3 v4.5

Before: 102 ms
  After:  83 ms
  Rel %: -19%

The experiments above were chosen to demonstrate that we are
improving the filtering of the merge-base set. In the first
example, more time is spent walking the history to find the
set of merge bases before the remove_redundant() call. The
starting commits are closer together in the second example,
therefore more time is spent in remove_redundant(). The relative
change in performance differs as expected.

Reported-by: Jakub Narebski <jna...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 9875feec01..5064db4e61 100644
--- a/commit.c
+++ b/commit.c
@@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
parse_commit(array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;


This initialization should be

    uint32_t min_generation = array[i]->generation;

since the assignment (using j) below skips the ith commit.

  
  		if (redundant[i])

continue;
@@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
continue;
filled_index[filled] = j;
work[filled++] = array[j];
+
+   if (array[j]->generation < min_generation)
+   min_generation = array[j]->generation;
}
-   common = paint_down_to_common(array[i], filled, work, 0);
+   common = paint_down_to_common(array[i], filled, work,
+ min_generation);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)




Re: branch --contains / tag --merged inconsistency

2018-04-30 Thread Derrick Stolee

On 4/27/2018 12:03 PM, SZEDER Gábor wrote:

Szia Feri,


I'm moving the IRC discussion here, because this might be a bug report
in the end.  So, kindly try these steps (103 MB free space required):

$ git clone https://github.com/ClusterLabs/pacemaker.git && cd pacemaker
[...]
$ git branch --contains Pacemaker-0.6.1
* master
$ git tag --merged master | fgrep Pacemaker-0.6
Pacemaker-0.6.0
Pacemaker-0.6.2
Pacemaker-0.6.3
Pacemaker-0.6.4
Pacemaker-0.6.5
Pacemaker-0.6.6

Notice that Pacemaker-0.6.1 is missing from the output.  Kind people on
IRC didn't find a quick explanation, and we all had to go eventually.
Is this expected behavior?  Reproduced with git 2.11.0 and 2.17.0.

The commit pointed to by the tag Pacemaker-0.6.1 and its parent have a
serious clock skew, i.e. they are a few months older than their parents:

$ git log --format='%h %ad %cd%d%n%s' --date=short 
Pacemaker-0.6.1^..47a8ef4c
47a8ef4ce 2008-02-15 2008-02-15
 Low: TE: Logging - display the op's magic field for unexpected and foreign 
events
b9cfcd6b4 2007-12-10 2007-12-10 (tag: Pacemaker-0.6.2)
 haresources2cib.py: set default-action-timeout to the default (20s)
52e7793e0 2007-12-10 2007-12-10
 haresources2cib.py: update ra parameters lists
dea277271 2008-02-14 2008-02-14
 Medium: Build: Turn on snmp support in rpm packages (patch from MATSUDA, 
Daiki)
f418742fe 2008-02-14 2008-02-14
 Low: Build: Update the .spec file with the one used by build service
ccfa716a5 2008-02-14 2008-02-14
 Medium: SNMP: Allow the snmp subagent to be built (patch from MATSUDA, 
Daiki)
50f0ade2d 2008-02-14 2008-02-14
 Low: Build: Update last release number
90f11667f 2008-02-14 2008-02-14
 Medium: Tools: Make sure the autoconf variables in haresources2cib are 
expanded
9d2383c46 2008-02-11 2008-02-11 (tag: Pacemaker-0.6.1)
 High: cib: Ensure the archived file hits the disk before returning

(branch|tag|describe|...) (--contains|--merged) use the commit timestamp
information as a heuristic to avoid traversing parts of the history,
which makes these operations, especially on big histories, an order of
magnitude or two faster.  Yeah, commit timestamps can't always be
trusted, but skewed commits are rare, and skewed commits with this much
skew are even rarer.

I'm not sure how (or if it's at all possible) to turn off this
timestamp-based optimisation.


This is actually a bit more complicated. The "--merged" check in 'git 
tag' uses a different mechanism to detect which tags are reachable. It 
uses a revision walk starting at the "merge commit" (master in your 
case) and all tags with the "limited" option (to ensure the walk happens 
during prepare_revision_walk()) but marks the merge commit as 
UNINTERESTING. The limit_list() method stops when all commits are marked 
UNINTERESTING - minus some "slop" related to the commits that start the 
walk.


One important note: the set of tags is important here. If you add a new 
tag to the root commit (git tag MyTag a2d71961f) then the walk succeeds 
by ensuring it walks until MyTag. This gets around the clock skew issue. 
There may be other more-recent tags with a clock-skew issue, but since 
Pacemaker-0.6.0 is the oldest tag, that requires the walk to continue 
until at least that date.


The commit-walk machinery in revision.c is rather complicated, and is 
used for a lot of different reasons, such as "git log" and this 
application in "git tag". It is on my list to refactor this code to use 
the commit-graph and generation numbers, but as we can see by this 
example, it is not easy to tease out what is happening in the code.


In a world where generation numbers are expected to be available, we 
could rewrite do_merge_filter() in ref-filter.c to call into 
paint_down_to_common() in commit.c using the new "min_generation" 
marker. By assigning the tags to be in the "twos" list and the merge 
commit in the "one" commit, we can check if the tags have the PARENT1 
flag after the walk in paint_down_to_common(). Due to the static nature 
of paint_down_to_common(), we will likely want to abstract this into an 
external method in commit.c, say can_reach_many(struct commit *from, 
struct commit_list *to).



FWIW, much work is being done on a cached commit graph including commit
generation numbers, which will solve this issue both correctly and more
efficiently.  Perhaps it will already be included in the next release.


The work in ds/generation-numbers is focused on the "git tag --contains" 
method, which does return correctly here (it is the reverse of the 
--merged condition):


Which tags can reach Pacemaker-0.6.1?

$ git tag --contains Pacemaker-0.6.1
(returns a big list)

This is the actual reverse lookup (which branches contain this tag?)

$ git branch --contains Pacemaker-0.6.1 | grep master
* master

These commands work despite clock skew. The commit-graph feature makes 
them faster.


Thanks,
-Stolee



Re: [PATCH v4 02/10] commit: add generation number to struct commmit

2018-04-30 Thread Derrick Stolee

On 4/28/2018 6:35 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
   more than the maximum generation number among the parents of A.

Very minor nitpick: it would be more readable wrapped differently:

   * If a commit A has parents, then the generation number of A is
 one more than the maximum generation number among parents of A.

Very minor nitpick: possibly "parents", not "the parents", but I am
not native English speaker.


Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Good explanation; I wonder if we want to have it in some shortened form
also in comments, and not only in the commit message.


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  alloc.c| 1 +
  commit-graph.c | 2 ++
  commit.h   | 4 
  3 files changed, 7 insertions(+)

I have reordered patches to make it easier to review.


diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
  #include "pretty.h"
  
  #define COMMIT_NOT_FROM_GRAPH 0x

+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0

I wonder if it wouldn't be good to have some short in-line comments
explaining those constants, or a block comment above them.

  
  struct commit_list {

struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
  };
  
  extern int save_commit_buffer;

All right, simple addition of the new field.  Nothing to go wrong here.

Sidenote: With 0x7FFF being (if I am not wrong) maximum graph_pos
and maximum number of nodes in commit graph, we won't hit 0x3FFF
generation number limit for all except very, very linear histories.


Both of these limits are far away from being realistic. But we could 
extend the maximum graph_pos independently from the maximum generation 
number now that we have the "capped" logic.





diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
  }

All right, start with initializing it with "not from commit-graph" value
after allocation.

  
diff --git a/commit-graph.c b/commit-graph.c

index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
  
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

+

I guess we should not worry about these "magical constants" sprinkled
here, like "+ 8" above.

Let's examine how it goes, taking a look at commit-graph-format.txt
in Documentation/technical/commit-graph-format.txt

  * The first H (g->hash_len) bytes are for the OID of the root tree.
  * The next 8 bytes are for the positions of the first two parents [...]

So 'commit_data + g->hash_len + 8' is our offset from the start of
commit data.  All right.

   * The next 8 bytes store the generation number of the commit and
 the commit time in seconds since EPOCH.  The generation number
 uses the higher 30 bits of the first 4 bytes. [...]

The higher 30 bits of the 4 bytes, which is 32 bits, means that we need
to shift 32-bit value 2 bits right, so that we get lower 30 bits of
32-bit value.  All right.

   All 4-byte numbers are in network order.

Shouldn't it be ntohl() to convert from network order to host order, and
not get_be32()?  I guess they are the same (network order is big-endian
order), and get_be32() is what rest of git uses...


ntohl() takes a 32-bit value, while get_be32() takes a pointer. This 
makes pulling network-bytes out of streams much cleaner with get_be32(), 
so I try to use that whenever possible.




Re: [PATCH] coccinelle: avoid wrong transformation suggestions from commit.cocci

2018-04-30 Thread Derrick Stolee

On 4/30/2018 5:31 AM, SZEDER Gábor wrote:

The semantic patch 'contrib/coccinelle/commit.cocci' added in
2e27bd7731 (treewide: replace maybe_tree with accessor methods,
2018-04-06) is supposed to "ensure that all references to the
'maybe_tree' member of struct commit are either mutations or accesses
through get_commit_tree()".  So get_commit_tree() clearly must be able
to directly access the 'maybe_tree' member, and 'commit.cocci' has a
bit of a roundabout workaround to ensure that get_commit_tree()'s
direct access in its return statement is not transformed: after all
references to 'maybe_tree' have been transformed to a call to
get_commit_tree(), including the reference in get_commit_tree()
itself, the last rule transforms back a 'return get_commit_tree()'
statement, back then found only in get_commit_tree() itself, to a
direct access.

Unfortunately, already the very next commit shows that this workaround
is insufficient: 7b8a21dba1 (commit-graph: lazy-load trees for
commits, 2018-04-06) extends get_commit_tree() with a condition
directly accessing the 'maybe_tree' member, and Coccinelle with
'commit.cocci' promptly detects it and suggests a transformation to
avoid it.  This transformation is clearly wrong, because calling
get_commit_tree() to access 'maybe_tree' _in_ get_commit_tree() would
obviously lead to recursion.  Furthermore, the same commit added
another, more specialized getter function get_commit_tree_in_graph(),
whose legitimate direct access to 'maybe_tree' triggers a similar
wrong transformation suggestion.


Thanks for catching this, Szeder. Sorry for the noise.


Exclude both of these getter functions from the general rule in
'commit.cocci' that matches their direct accesses to 'maybe_tree'.
Also exclude load_tree_for_commit(), which, as static helper funcion
of get_commit_tree_in_graph(), has legitimate direct access to
'maybe_tree' as well.


This is an interesting feature of Coccinelle. Happy to learn it.


The last rule transforming back 'return get_commit_tree()' statements
to direct accesses thus became unnecessary, remove it.

Signed-off-by: SZEDER Gábor <szeder@gmail.com>


I applied this locally on 'next' and ran the check. I succeeded with no 
changes.


Thanks!

Reviewed-by: Derrick Stolee <dsto...@microsoft.com>



---
  contrib/coccinelle/commit.cocci | 10 --
  1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/contrib/coccinelle/commit.cocci b/contrib/coccinelle/commit.cocci
index ac38525941..a7e9215ffc 100644
--- a/contrib/coccinelle/commit.cocci
+++ b/contrib/coccinelle/commit.cocci
@@ -10,11 +10,15 @@ expression c;
  - c->maybe_tree->object.oid.hash
  + get_commit_tree_oid(c)->hash
  
+// These excluded functions must access c->maybe_tree direcly.

  @@
+identifier f !~ 
"^(get_commit_tree|get_commit_tree_in_graph|load_tree_for_commit)$";
  expression c;
  @@
+  f(...) {...
  - c->maybe_tree
  + get_commit_tree(c)
+  ...}
  
  @@

  expression c;
@@ -22,9 +26,3 @@ expression s;
  @@
  - get_commit_tree(c) = s
  + c->maybe_tree = s
-
-@@
-expression c;
-@@
-- return get_commit_tree(c);
-+ return c->maybe_tree;




[RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-07 Thread Derrick Stolee
Commentary: This file format uses the large offsets from the pack-index
version 2 format, but drops the CRC32 hashes from that format.

Also: I included the HASH footer at the end only because it is already in
the pack and pack-index formats, but not because it is particularly useful
here. If possible, I'd like to remove it and speed up MIDX writes.

-- >8 --

Add a technical documentation file describing the design
for the multi-pack index (MIDX). Includes current limitations
and future work.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/multi-pack-index.txt | 149 +++
 1 file changed, 149 insertions(+)
 create mode 100644 Documentation/technical/multi-pack-index.txt

diff --git a/Documentation/technical/multi-pack-index.txt 
b/Documentation/technical/multi-pack-index.txt
new file mode 100644
index 00..d31b03dec5
--- /dev/null
+++ b/Documentation/technical/multi-pack-index.txt
@@ -0,0 +1,149 @@
+Multi-Pack-Index (MIDX) Design Notes
+
+
+The Git object directory contains a 'pack' directory containing
+packfiles (with suffix ".pack") and pack-indexes (with suffix
+".idx"). The pack-indexes provide a way to lookup objects and
+navigate to their offset within the pack, but these must come
+in pairs with the packfiles. This pairing depends on the file
+names, as the pack-index differs only in suffix with its pack-
+file. While the pack-indexes provide fast lookup per packfile,
+this performance degrades as the number of packfiles increases,
+because abbreviations need to inspect every packfile and we are
+more likely to have a miss on our most-recently-used packfile.
+For some large repositories, repacking into a single packfile
+is not feasible due to storage space or excessive repack times.
+
+The multi-pack-index (MIDX for short, with suffix ".midx")
+stores a list of objects and their offsets into multiple pack-
+files. It contains:
+
+- A list of packfile names.
+- A sorted list of object IDs.
+- A list of metadata for the ith object ID including:
+  - A value j referring to the jth packfile.
+  - An offset within the jth packfile for the object.
+- If large offsets are required, we use another list of large
+  offsets similar to version 2 pack-indexes.
+
+Thus, we can provide O(log N) lookup time for any number
+of packfiles.
+
+A new config setting 'core.midx' must be enabled before writing
+or reading MIDX files.
+
+The MIDX files are updated by the 'midx' builtin with the
+following common parameter combinations:
+
+- 'git midx' gives the hash of the current MIDX head.
+- 'git midx --write --update-head --delete-expired' writes a new
+  MIDX file, points the MIDX head to that file, and deletes the
+  existing MIDX file if out-of-date.
+- 'git midx --read' lists some basic information about the current
+  MIDX head. Used for basic tests.
+- 'git midx --clear' deletes the current MIDX head.
+
+Design Details
+--
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.
+
+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+- If an object appears in multiple packfiles, then only one copy
+  is stored in the MIDX. This has a possible performance issue:
+  If an object appears as the delta-base of multiple objects from
+  multiple packs, then cross-pack delta calculations may slow down.
+  This is currently only theoretical and has not been demonstrated
+  to be a measurable issue.
+
+Current Limitations
+---
+
+- MIDX files are managed only by the midx builtin and is not
+  automatically updated on clone or fetch.
+
+- There is no '--verify' option for the midx builtin to verify
+  the contents of the MIDX file against the pack contents.
+
+- Constructing a MIDX file currently requires the single-pack
+  index for every pack being added to the MIDX.
+
+- The fsck builtin does not check MIDX files, but should.
+
+- The repack builtin is not aware of the MIDX files, and may
+  invalidate the MIDX files by deleting existing packfiles. The
+  MIDX may also be e

[RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-07 Thread Derrick Stolee
This RFC includes a new way to index the objects in multiple packs
using one file, called the multi-pack index (MIDX).

The commits are split into parts as follows:

[01] - A full design document.

[02] - The full file format for MIDX files.

[03] - Creation of core.midx config setting.

[04-12] - Creation of "midx" builtin that writes, reads, and deletes
  MIDX files.

[13-18] - Consume MIDX files for abbreviations and object loads.

The main goals of this RFC are:

* Determine interest in this feature.

* Find other use cases for the MIDX feature.

* Design a proper command-line interface for constructing and checking
  MIDX files. The current "midx" builtin is probably inadequate.

* Determine what additional changes are needed before the feature can
  be merged. Specifically, I'm interested in the interactions with
  repack and fsck. The current patch also does not update the MIDX on
  a fetch (which adds a packfile) but would be valuable. Whenever
  possible, I tried to leave out features that could be added in a
  later patch.

* Consider splitting this patch into multiple patches, such as:

i. The MIDX design document.
   ii. The command-line interface for building and reading MIDX files.
  iii. Integrations with abbreviations and object lookups.

Please do not send any style nits to this patch, as I expect the code to
change dramatically before we consider merging.

I created three copies of the Linux repo with 1, 24, and 127 packfiles
each using 'git repack -adfF --max-pack-size=[64m|16m]'. These copies
gave significant performance improvements on the following comand:

git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
--+-+++--
1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
   24 | 90.81 s |40.06 s | -55.9% |  +12.4%
  127 |257.97 s |42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

git midx --write --update-head --delete-expired --pack-dir=

As that release deploys we will gather more specific numbers on the
performance improvements and report them in this thread.

Derrick Stolee (18):
  docs: Multi-Pack Index (MIDX) Design Notes
  midx: specify midx file format
  midx: create core.midx config setting
  midx: write multi-pack indexes for an object list
  midx: create midx builtin with --write mode
  midx: add t5318-midx.sh test script
  midx: teach midx --write to update midx-head
  midx: teach git-midx to read midx file details
  midx: find details of nth object in midx
  midx: use existing midx when writing
  midx: teach git-midx to clear midx files
  midx: teach git-midx to delete expired files
  t5318-midx.h: confirm git actions are stable
  midx: load midx files when loading packs
  midx: use midx for approximate object count
  midx: nth_midxed_object_oid() and bsearch_midx()
  sha1_name: use midx for abbreviations
  packfile: use midx for object loads

 .gitignore   |   1 +
 Documentation/config.txt |   3 +
 Documentation/git-midx.txt   | 106 
 Documentation/technical/multi-pack-index.txt | 149 +
 Documentation/technical/pack-format.txt  |  85 +++
 Makefile |   2 +
 builtin.h|   1 +
 builtin/midx.c   | 352 +++
 cache.h  |   1 +
 command-list.txt |   1 +
 config.c |   5 +
 environment.c|   2 +
 git.c|   1 +
 midx.c   | 850 +++
 midx.h   | 136 +
 packfile.c   |  79 ++-
 packfile.h  

[RFC PATCH 02/18] midx: specify midx file format

2018-01-07 Thread Derrick Stolee
A multi-pack-index (MIDX) file indexes the objects in multiple
packfiles in a single pack directory. After a simple fixed-size
header, the version 1 file format uses chunks to specify
different regions of the data that correspond to different types
of data, including:

- List of OIDs in lex-order
- A fanout table into the OID list
- List of packfile names (relative to pack directory)
- List of object metadata
- Large offsets (if needed)

By adding extra optional chunks, we can easily extend this format
without invalidating written v1 files.

One value in the header corresponds to a number of "base MIDX files"
and will always be zero until the value is used in a later patch.

We considered using a hashtable format instead of an ordered list
of objects to reduce the O(log N) lookups to O(1) lookups, but two
main issues arose that lead us to abandon the idea:

- Extra space required to ensure collision counts were low.
- We need to identify the two lexicographically closest OIDs for
  fast abbreviations. Binary search allows this.

The current solution presents multiple packfiles as if they were
packed into a single packfile with one pack-index.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/pack-format.txt | 85 +
 1 file changed, 85 insertions(+)

diff --git a/Documentation/technical/pack-format.txt 
b/Documentation/technical/pack-format.txt
index 8e5bf60be3..ab459ef142 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -160,3 +160,88 @@ Pack file entry: <+
 corresponding packfile.
 
 20-byte SHA-1-checksum of all of the above.
+
+== midx-*.midx files have the following format:
+
+The multi-pack-index (MIDX) files refer to multiple pack-files.
+
+In order to allow extensions that add extra data to the MIDX format, we
+organize the body into "chunks" and provide a lookup table at the beginning
+of the body. The header includes certain length values, such as the number
+of packs, the number of base MIDX files, hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'M', 'I', 'D', 'X'}
+
+   4-byte version number:
+   Git currently only supports version 1.
+
+   1-byte Object Id Version (1 = SHA-1)
+
+   1-byte Object Id Length (H)
+
+   1-byte number (I) of base multi-pack-index files:
+   This value is currently always zero.
+
+   1-byte number (C) of "chunks"
+
+   4-byte number (P) of pack files
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes providing the chunk offsets:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.
+   (Chunks are provided in file-order, so you can infer the length
+   using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of objects (N). The number of objects with first byte
+   value i is (F[i] - F[i-1]) for i > 0.
+
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all objects in the MIDX are stored in lexicographic
+   order in this chunk.
+
+   Object Offsets (ID: {'O', 'O', 'F', 'F'}) (N * 8 bytes)
+   Stores two 4-byte values for every object.
+   1: The pack-int-id for the pack storing this object.
+   2: The offset within the pack.
+   If all offsets are less than 2^31, then the large offset chunk
+   will not exist and offsets are stored as in IDX v1.
+   If there is at least one offset value larger than 2^32-1, then
+   the large offset chunk must exist. If the large offset chunk
+   exists and the 31st bit is on, then removing that bit reveals
+   the row in the large offsets containing the 8-byte offset of
+   this object.
+
+   [Optional] Object Large Offsets (ID: {'L', 'O', 'F', 'F'})
+   8-byte offsets into large packfiles.
+
+   Packfile Name Lookup (ID: {'P', 'L', 'O', 'O'}) (P * 4 bytes)
+   P * 4 bytes storing the offset in the packfile name chunk for
+   the null-terminated string containing the filename for the
+   ith packfile. The filename is relative to the MIDX file's parent
+   directory.
+
+   Packfile Names (ID: {'P', 'N', 'A', 'M'})
+   Stores the packfile names as concatenated, null-terminated strings.
+   Packfiles must be list

[RFC PATCH 11/18] midx: teach git-midx to clear midx files

2018-01-07 Thread Derrick Stolee
As a way to troubleshoot unforeseen problems with MIDX files, provide
a way to delete "midx-head" and the MIDX it references.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-midx.txt | 12 +++-
 builtin/midx.c | 31 ++-
 t/t5318-midx.sh|  9 +
 3 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 3eeed1d969..c184d3a593 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 
 [verse]
-'git midx' [--write|--read]  [--pack-dir ]
+'git midx' [--write|--read|--clear]  [--pack-dir ]
 
 DESCRIPTION
 ---
@@ -22,6 +22,10 @@ OPTIONS
Use given directory for the location of packfiles, pack-indexes,
and MIDX files.
 
+--clear::
+   If specified, delete the midx file specified by midx-head, and
+   midx-head. (Cannot be combined with --write or --read.)
+
 --read::
If specified, read a midx file specified by the midx-head file
and output basic details about the midx file. (Cannot be combined
@@ -79,6 +83,12 @@ $ git midx --read
 $ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e
 
 
+* Delete the current midx-head and the file it references.
++
+---
+$ git midx --clear
+---
+
 CONFIGURATION
 -
 
diff --git a/builtin/midx.c b/builtin/midx.c
index aff2085771..b30ef36ff8 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -11,11 +11,13 @@
 static char const * const builtin_midx_usage[] = {
N_("git midx [--pack-dir ]"),
N_("git midx --write [--update-head] [--pack-dir ]"),
+   N_("git midx --clear [--pack-dir ]"),
NULL
 };
 
 static struct opts_midx {
const char *pack_dir;
+   int clear;
int read;
const char *midx_id;
int write;
@@ -24,6 +26,29 @@ static struct opts_midx {
struct object_id old_midx_oid;
 } opts;
 
+static int midx_clear(void)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   char *old_path;
+
+   if (!opts.has_existing)
+   return 0;
+
+   strbuf_addstr(_path, opts.pack_dir);
+   strbuf_addstr(_path, "/");
+   strbuf_addstr(_path, "midx-head");
+   if (remove_path(head_path.buf))
+   die("failed to remove path %s", head_path.buf);
+   strbuf_release(_path);
+
+   old_path = get_midx_filename_oid(opts.pack_dir, _midx_oid);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+   free(old_path);
+
+   return 0;
+}
+
 static int midx_read(void)
 {
struct object_id midx_oid;
@@ -263,6 +288,8 @@ int cmd_midx(int argc, const char **argv, const char 
*prefix)
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory containing set of packfile and 
pack-index pairs.") },
+   OPT_BOOL('c', "clear", ,
+   N_("clear midx file and midx-head")),
OPT_BOOL('r', "read", ,
N_("read midx file")),
{ OPTION_STRING, 'M', "midx-id", _id,
@@ -287,7 +314,7 @@ int cmd_midx(int argc, const char **argv, const char 
*prefix)
 builtin_midx_options,
 builtin_midx_usage, 0);
 
-   if (opts.write + opts.read > 1)
+   if (opts.write + opts.read + opts.clear > 1)
usage_with_options(builtin_midx_usage, builtin_midx_options);
 
if (!opts.pack_dir) {
@@ -299,6 +326,8 @@ int cmd_midx(int argc, const char **argv, const char 
*prefix)
 
opts.has_existing = !!get_midx_head_oid(opts.pack_dir, 
_midx_oid);
 
+   if (opts.clear)
+   return midx_clear();
if (opts.read)
return midx_read();
if (opts.write)
diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 2e52389442..9337355ab3 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -143,4 +143,13 @@ test_expect_success 'write-midx in bare repo' \
  git midx --read >output &&
  cmp output expect'
 
+test_expect_success 'midx --clear' \
+'git midx --clear &&
+ test_path_is_missing "${baredir}/midx-${midx4}.midx" &&
+ test_path_is_missing "${baredir}/midx-head" &&
+ cd ../full &&
+ git midx --clear &&
+ test_path_is_missing "${packdir}/midx-${midx4}.midx" &&
+ test_path_is_missing "${packdir}/midx-head"'
+
 test_done
-- 
2.15.0



[RFC PATCH 15/18] midx: use midx for approximate object count

2018-01-07 Thread Derrick Stolee
The MIDX files contain a complete object count, so we can report the number
of objects in the MIDX. The count remains approximate as there may be overlap
between the packfiles not referenced by the MIDX.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 packfile.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/packfile.c b/packfile.c
index 1c0822878b..866a1f30dd 100644
--- a/packfile.c
+++ b/packfile.c
@@ -803,7 +803,8 @@ static void prepare_packed_git_one(char *objdir, int local)
if (ends_with(de->d_name, ".idx") ||
ends_with(de->d_name, ".pack") ||
ends_with(de->d_name, ".bitmap") ||
-   ends_with(de->d_name, ".keep"))
+   ends_with(de->d_name, ".keep") ||
+   ends_with(de->d_name, ".midx"))
string_list_append(, path.buf);
else
report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
@@ -828,9 +829,12 @@ unsigned long approximate_object_count(void)
static unsigned long count;
if (!approximate_object_count_valid) {
struct packed_git *p;
+   struct midxed_git *m;
 
-   prepare_packed_git();
+   prepare_packed_git_internal(1);
count = 0;
+   for (m = midxed_git; m; m = m->next)
+   count += m->num_objects;
for (p = packed_git; p; p = p->next) {
if (open_pack_index(p))
continue;
-- 
2.15.0



[RFC PATCH 13/18] t5318-midx.h: confirm git actions are stable

2018-01-07 Thread Derrick Stolee
Perform some basic read-only operations that load objects and find
abbreviations. As this functionality begins to reference MIDX files,
ensure the output matches when using MIDX files and when not using them.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/t5318-midx.sh | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 42d103c879..00be852ed3 100755
--- a/t/t5318-midx.sh
+++ b/t/t5318-midx.sh
@@ -37,6 +37,19 @@ _midx_read_expect() {
EOF
 }
 
+_midx_git_two_modes() {
+   git -c core.midx=true $1 >output
+   git -c core.midx=false $1 >expect
+}
+
+_midx_git_behavior() {
+   test_expect_success 'check normal git operations' \
+   '_midx_git_two_modes "log --patch master" &&
+cmp output expect &&
+_midx_git_two_modes "rev-list --all --objects" &&
+cmp output expect'
+}
+
 test_expect_success 'write-midx from index version 1' \
 'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 
${packdir}/test-1) &&
  midx1=$(git midx --write) &&
@@ -48,6 +61,8 @@ test_expect_success 'write-midx from index version 1' \
  git midx --read --midx-id=${midx1} >output &&
  cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx from index version 2' \
 'rm "${packdir}/test-1-${pack1}.pack" &&
  pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 
${packdir}/test-2) &&
@@ -61,6 +76,8 @@ test_expect_success 'write-midx from index version 2' \
  git midx --read> output &&
  cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'Create more objects' \
 'for i in $(test_seq 100)
  do
@@ -71,6 +88,8 @@ test_expect_success 'Create more objects' \
  git commit -m "test data 2" &&
  git branch commit2 HEAD'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx with two packs' \
 'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects 
--index-version=2 ${packdir}/test-3) &&
  midx3=$(git midx --write --update-head) &&
@@ -84,6 +103,8 @@ test_expect_success 'write-midx with two packs' \
  git midx --read >output &&
  cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'Add more packs' \
 'for i in $(test_seq 10)
  do
@@ -107,6 +128,8 @@ test_expect_success 'Add more packs' \
  git pack-objects --index-version=2 ${packdir}/test-pack output &&
  cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'write-midx with no new packs' \
 'midx5=$(git midx --write --update-head --delete-expired) &&
  test_path_is_file ${packdir}/midx-${midx5}.midx &&
@@ -127,6 +152,8 @@ test_expect_success 'write-midx with no new packs' \
  test_path_is_file ${packdir}/midx-head &&
  test $(cat ${packdir}/midx-head) = "$midx4"'
 
+_midx_git_behavior
+
 test_expect_success 'create bare repo' \
 'cd .. &&
  git clone --bare full bare &&
@@ -146,6 +173,8 @@ test_expect_success 'write-midx in bare repo' \
  git midx --read >output &&
  cmp output expect'
 
+_midx_git_behavior
+
 test_expect_success 'midx --clear' \
 'git midx --clear &&
  test_path_is_missing "${baredir}/midx-${midx4}.midx" &&
@@ -155,4 +184,6 @@ test_expect_success 'midx --clear' \
  test_path_is_missing "${packdir}/midx-${midx4}.midx" &&
  test_path_is_missing "${packdir}/midx-head"'
 
+_midx_git_behavior
+
 test_done
-- 
2.15.0



[RFC PATCH 03/18] midx: create core.midx config setting

2018-01-07 Thread Derrick Stolee
As the multi-pack-index feature is being developed, we use a config
setting 'core.midx' to enable all use of MIDX files.

Since MIDX files are designed as a way to augment the existing data
stores in Git, turning this setting off will revert to previous
behavior without needing to downgrade. This can also be a repo-
specific setting if the MIDX is misbehaving in only one repo.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 2 ++
 4 files changed, 11 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 64c1dbba94..dc7cb4b900 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -896,6 +896,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.midx::
+   Enable "multi-pack-index" feature. Set to true to read and write MIDX 
files.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index a2ec8c0b55..f4943d3136 100644
--- a/cache.h
+++ b/cache.h
@@ -820,6 +820,7 @@ extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
 extern const char *core_fsmonitor;
+extern int core_midx;
 
 /*
  * Include broken refs in all ref iterations, which will
diff --git a/config.c b/config.c
index e617c2018d..17f560ddc4 100644
--- a/config.c
+++ b/config.c
@@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.midx")) {
+   core_midx = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..57a3943849 100644
--- a/environment.c
+++ b/environment.c
@@ -78,6 +78,8 @@ int protect_hfs = PROTECT_HFS_DEFAULT;
 int protect_ntfs = PROTECT_NTFS_DEFAULT;
 const char *core_fsmonitor;
 
+int core_midx;
+
 /*
  * The character that begins a commented line in user-editable file
  * that is subject to stripspace.
-- 
2.15.0



[RFC PATCH 04/18] midx: write multi-pack indexes for an object list

2018-01-07 Thread Derrick Stolee
The write_midx_file() method takes a list of packfiles and indexed
objects with offset information and writes according to the format
in Documentation/technical/pack-format.txt. The chunks are separated
into methods.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile |   1 +
 midx.c   | 412 +++
 midx.h   |  42 +++
 3 files changed, 455 insertions(+)
 create mode 100644 midx.c
 create mode 100644 midx.h

diff --git a/Makefile b/Makefile
index 2a81ae22e9..d0d810951f 100644
--- a/Makefile
+++ b/Makefile
@@ -827,6 +827,7 @@ LIB_OBJS += merge.o
 LIB_OBJS += merge-blobs.o
 LIB_OBJS += merge-recursive.o
 LIB_OBJS += mergesort.o
+LIB_OBJS += midx.o
 LIB_OBJS += mru.o
 LIB_OBJS += name-hash.o
 LIB_OBJS += notes.o
diff --git a/midx.c b/midx.c
new file mode 100644
index 00..5c320726ed
--- /dev/null
+++ b/midx.c
@@ -0,0 +1,412 @@
+#include "cache.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "midx.h"
+
+#define MIDX_SIGNATURE 0x4d494458 /* "MIDX" */
+#define MIDX_CHUNKID_PACKLOOKUP 0x504c4f4f /* "PLOO" */
+#define MIDX_CHUNKID_PACKNAMES 0x504e414d /* "PNAM" */
+#define MIDX_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define MIDX_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define MIDX_CHUNKID_OBJECTOFFSETS 0x4f4f4646 /* "OOFF" */
+#define MIDX_CHUNKID_LARGEOFFSETS 0x4c4f4646 /* "LOFF" */
+
+#define MIDX_VERSION_1 1
+#define MIDX_VERSION MIDX_VERSION_1
+
+#define MIDX_OID_VERSION_SHA1 1
+#define MIDX_OID_LEN_SHA1 20
+#define MIDX_OID_VERSION MIDX_OID_VERSION_SHA1
+#define MIDX_OID_LEN MIDX_OID_LEN_SHA1
+
+#define MIDX_LARGE_OFFSET_NEEDED 0x8000
+
+char* get_midx_filename_oid(const char *pack_dir,
+   struct object_id *oid)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   strbuf_addstr(_path, pack_dir);
+   strbuf_addstr(_path, "/midx-");
+   strbuf_addstr(_path, oid_to_hex(oid));
+   strbuf_addstr(_path, ".midx");
+
+   return strbuf_detach(_path, NULL);
+}
+
+struct pack_midx_details_internal {
+   uint32_t pack_int_id;
+   uint32_t internal_offset;
+};
+
+static int midx_sha1_compare(const void *_a, const void *_b)
+{
+   struct pack_midx_entry *a = *(struct pack_midx_entry **)_a;
+   struct pack_midx_entry *b = *(struct pack_midx_entry **)_b;
+   return oidcmp(>oid, >oid);
+}
+
+static void write_midx_chunk_packlookup(
+   struct sha1file *f,
+   const char **pack_names, uint32_t nr_packs)
+{
+   uint32_t i, cur_len = 0;
+
+   for (i = 0; i < nr_packs; i++) {
+   uint32_t swap_len = htonl(cur_len);
+   sha1write(f, _len, 4);
+   cur_len += strlen(pack_names[i]) + 1;
+   }
+}
+
+static void write_midx_chunk_packnames(
+   struct sha1file *f,
+   const char **pack_names, uint32_t nr_packs)
+{
+   uint32_t i;
+   for (i = 0; i < nr_packs; i++)
+   sha1write(f, pack_names[i], strlen(pack_names[i]) + 1);
+}
+
+static void write_midx_chunk_oidfanout(
+   struct sha1file *f,
+   struct pack_midx_entry **objects, uint32_t nr_objects)
+{
+   struct pack_midx_entry **list = objects;
+   struct pack_midx_entry **last = objects + nr_objects;
+   uint32_t count_distinct = 0;
+   uint32_t i;
+
+   /*
+   * Write the first-level table (the list is sorted,
+   * but we use a 256-entry lookup to be able to avoid
+   * having to do eight extra binary search iterations).
+   */
+   for (i = 0; i < 256; i++) {
+   struct pack_midx_entry **next = list;
+   struct pack_midx_entry *prev = 0;
+   uint32_t swap_distinct;
+
+   while (next < last) {
+   struct pack_midx_entry *obj = *next;
+   if (obj->oid.hash[0] != i)
+   break;
+
+   if (!prev || oidcmp(&(prev->oid), &(obj->oid)))
+   count_distinct++;
+
+   prev = obj;
+   next++;
+   }
+
+   swap_distinct = htonl(count_distinct);
+   sha1write(f, _distinct, 4);
+   list = next;
+   }
+}
+
+static void write_midx_chunk_oidlookup(
+   struct sha1file *f, unsigned char hash_len,
+   struct pack_midx_entry **objects, uint32_t nr_objects)
+{
+   struct pack_midx_entry **list = objects;
+   struct object_id *last_oid = 0;
+   uint32_t i;
+
+   for (i = 0; i < nr_objects; i++) {
+   struct pack_midx_entry *obj = *list++;
+
+   if (last_oid && !oidcmp(last_oid, >oid))
+   continue;
+
+   last_oid = >oid;
+   sha1write(f, obj->

[RFC PATCH 05/18] midx: create midx builtin with --write mode

2018-01-07 Thread Derrick Stolee
Commentary: As we extend the function of the midx builtin, I expand the
SYNOPSIS row of "git-midx.txt" but do not create multiple rows. If this
builtin doesn't change too much, I will rewrite the SYNOPSIS to be multi-
lined, such as in "git-branch.txt".

-- >8 --

Create, document, and implement the first ability of the midx builtin.

The --write subcommand creates a multi-pack-index for all indexed
packfiles within a given pack directory. If none is provided, the
objects/pack directory is implied. The arguments allow specifying the
pack directory so we can add MIDX files to alternates.

The packfiles are expected to be paired with pack-indexes and are
otherwise ignored. This simplifies the implementation and also keeps
compatibility with older versions of Git (or changing core.midx to
false).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |   1 +
 Documentation/git-midx.txt |  54 +
 Makefile   |   1 +
 builtin.h  |   1 +
 builtin/midx.c | 195 +
 command-list.txt   |   1 +
 git.c  |   1 +
 7 files changed, 254 insertions(+)
 create mode 100644 Documentation/git-midx.txt
 create mode 100644 builtin/midx.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..545e195f2a 100644
--- a/.gitignore
+++ b/.gitignore
@@ -95,6 +95,7 @@
 /git-merge-subtree
 /git-mergetool
 /git-mergetool--lib
+/git-midx
 /git-mktag
 /git-mktree
 /git-name-rev
diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
new file mode 100644
index 00..17464222c1
--- /dev/null
+++ b/Documentation/git-midx.txt
@@ -0,0 +1,54 @@
+git-midx(1)
+
+
+NAME
+
+git-midx - Write and verify multi-pack-indexes (MIDX files).
+
+
+SYNOPSIS
+
+[verse]
+'git midx' --write [--pack-dir ]
+
+DESCRIPTION
+---
+Write a MIDX file.
+
+OPTIONS
+---
+
+--pack-dir ::
+   Use given directory for the location of packfiles, pack-indexes,
+   and MIDX files.
+
+--write::
+   If specified, write a new midx file to the pack directory using
+   the packfiles present. Outputs the hash of the result midx file.
+
+EXAMPLES
+
+
+* Write a MIDX file for the packfiles in your local .git folder.
++
+
+$ git midx --write
+
+
+* Write a MIDX file for the packfiles in a different folder
++
+-
+$ git midx --write --pack-dir ../../alt/pack/
+-
+
+CONFIGURATION
+-
+
+core.midx::
+   The midx command will fail if core.midx is false.
+   Also, the written MIDX files will be ignored by other commands
+   unless core.midx is true.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index d0d810951f..5c458705c1 100644
--- a/Makefile
+++ b/Makefile
@@ -980,6 +980,7 @@ BUILTIN_OBJS += builtin/merge-index.o
 BUILTIN_OBJS += builtin/merge-ours.o
 BUILTIN_OBJS += builtin/merge-recursive.o
 BUILTIN_OBJS += builtin/merge-tree.o
+BUILTIN_OBJS += builtin/midx.o
 BUILTIN_OBJS += builtin/mktag.o
 BUILTIN_OBJS += builtin/mktree.o
 BUILTIN_OBJS += builtin/mv.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..880383e341 100644
--- a/builtin.h
+++ b/builtin.h
@@ -188,6 +188,7 @@ extern int cmd_merge_ours(int argc, const char **argv, 
const char *prefix);
 extern int cmd_merge_file(int argc, const char **argv, const char *prefix);
 extern int cmd_merge_recursive(int argc, const char **argv, const char 
*prefix);
 extern int cmd_merge_tree(int argc, const char **argv, const char *prefix);
+extern int cmd_midx(int argc, const char **argv, const char *prefix);
 extern int cmd_mktag(int argc, const char **argv, const char *prefix);
 extern int cmd_mktree(int argc, const char **argv, const char *prefix);
 extern int cmd_mv(int argc, const char **argv, const char *prefix);
diff --git a/builtin/midx.c b/builtin/midx.c
new file mode 100644
index 00..4aae14cf8e
--- /dev/null
+++ b/builtin/midx.c
@@ -0,0 +1,195 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "packfile.h"
+#include "parse-options.h"
+#include "midx.h"
+
+static char const * const builtin_midx_usage[] = {
+   N_("git midx --write [--pack-dir ]"),
+   NULL
+};
+
+static struct opts_midx {
+   const char *pack_dir;
+   int write;
+} opts;
+
+static int build_midx_from_packs(
+   const char *pack_dir,
+   const char **pack_names, uint32_t nr_packs,
+   const char **midx_id)
+{
+   struct packed_git **packs;
+   const char **installed_pack_names;
+   uint32_t i, j, nr_install

[RFC PATCH 07/18] midx: teach midx --write to update midx-head

2018-01-07 Thread Derrick Stolee
There may be multiple MIDX files in a single pack directory. The primary
file is pointed to by a pointer file "midx-head" that contains an OID.
The MIDX file to load is then given by "midx-.midx".

This head file will be especially important when the MIDX files are
extended to be incremental and we expect multiple MIDX files at any
point.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-midx.txt | 19 ++-
 builtin/midx.c | 32 ++--
 midx.c | 31 +++
 midx.h |  3 +++
 t/t5318-midx.sh| 33 ++---
 5 files changed, 104 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 17464222c1..01f79cbba5 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 
 [verse]
-'git midx' --write [--pack-dir ]
+'git midx' --write  [--pack-dir ]
 
 DESCRIPTION
 ---
@@ -26,15 +26,32 @@ OPTIONS
If specified, write a new midx file to the pack directory using
the packfiles present. Outputs the hash of the result midx file.
 
+--update-head::
+   If specified with --write, update the midx-head file to point to
+   the written midx file.
+
 EXAMPLES
 
 
+* Read the midx-head file and output the OID of the head MIDX file.
++
+
+$ git midx
+
+
 * Write a MIDX file for the packfiles in your local .git folder.
 +
 
 $ git midx --write
 
 
+* Write a MIDX file for the packfiles in your local .git folder and
+* update the midx-head file.
++
+
+$ git midx --write --update-head
+
+
 * Write a MIDX file for the packfiles in a different folder
 +
 -
diff --git a/builtin/midx.c b/builtin/midx.c
index 4aae14cf8e..84ce6588a2 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -9,13 +9,17 @@
 #include "midx.h"
 
 static char const * const builtin_midx_usage[] = {
-   N_("git midx --write [--pack-dir ]"),
+   N_("git midx [--pack-dir ]"),
+   N_("git midx --write [--update-head] [--pack-dir ]"),
NULL
 };
 
 static struct opts_midx {
const char *pack_dir;
int write;
+   int update_head;
+   int has_existing;
+   struct object_id old_midx_oid;
 } opts;
 
 static int build_midx_from_packs(
@@ -109,6 +113,22 @@ static int build_midx_from_packs(
return 0;
 }
 
+static void update_head_file(const char *pack_dir, const char *midx_id)
+{
+   int fd;
+   struct lock_file lk = LOCK_INIT;
+   char *head_path = get_midx_head_filename(pack_dir);
+
+   fd = hold_lock_file_for_update(, head_path, LOCK_DIE_ON_ERROR);
+   FREE_AND_NULL(head_path);
+
+   if (fd < 0)
+   die_errno("unable to open midx-head");
+
+   write_in_full(fd, midx_id, GIT_MAX_HEXSZ);
+   commit_lock_file();
+}
+
 static int midx_write(void)
 {
const char **pack_names = NULL;
@@ -152,6 +172,9 @@ static int midx_write(void)
 
printf("%s\n", midx_id);
 
+   if (opts.update_head)
+   update_head_file(opts.pack_dir, midx_id);
+
 cleanup:
if (pack_names)
FREE_AND_NULL(pack_names);
@@ -166,6 +189,8 @@ int cmd_midx(int argc, const char **argv, const char 
*prefix)
N_("The pack directory containing set of packfile and 
pack-index pairs.") },
OPT_BOOL('w', "write", ,
N_("write midx file")),
+   OPT_BOOL('u', "update-head", _head,
+   N_("update midx-head to written midx file")),
OPT_END(),
};
 
@@ -187,9 +212,12 @@ int cmd_midx(int argc, const char **argv, const char 
*prefix)
opts.pack_dir = strbuf_detach(, NULL);
}
 
+   opts.has_existing = !!get_midx_head_oid(opts.pack_dir, 
_midx_oid);
+
if (opts.write)
return midx_write();
 
-   usage_with_options(builtin_midx_usage, builtin_midx_options);
+   if (opts.has_existing)
+   printf("%s\n", oid_to_hex(_midx_oid));
return 0;
 }
diff --git a/midx.c b/midx.c
index 5c320726ed..f4178c1b81 100644
--- a/midx.c
+++ b/midx.c
@@ -34,6 +34,37 @@ char* get_midx_filename_oid(const char *pack_dir,
return strbuf_detach(_path, NULL);
 }
 
+char *get_midx_head_filename(const char *pack_dir)
+{
+   struct strbuf head_filename = 

[RFC PATCH 17/18] sha1_name: use midx for abbreviations

2018-01-07 Thread Derrick Stolee
Create unique_in_midx() to mimic behavior of unique_in_pack().

Create find_abbrev_len_for_midx() to mimic behavior of
find_abbrev_len_for_pack().

Consume these methods when interacting with abbreviations.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1_name.c | 70 +++--
 1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..2f426e136e 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -10,6 +10,7 @@
 #include "dir.h"
 #include "sha1-array.h"
 #include "packfile.h"
+#include "midx.h"
 
 static int get_oid_oneline(const char *, struct object_id *, struct 
commit_list *);
 
@@ -190,11 +191,40 @@ static void unique_in_pack(struct packed_git *p,
}
 }
 
+static void unique_in_midx(struct midxed_git *m,
+  struct disambiguate_state *ds)
+{
+   uint32_t num, i, first = 0;
+   const struct object_id *current = NULL;
+
+   if (!m->num_objects)
+   return;
+
+   num = m->num_objects;
+   bsearch_midx(m, ds->bin_pfx.hash, );
+
+   /*
+* At this point, "first" is the location of the lowest object
+* with an object name that could match "bin_pfx".  See if we have
+* 0, 1 or more objects that actually match(es).
+*/
+   for (i = first; i < num && !ds->ambiguous; i++) {
+   struct object_id oid;
+   current = nth_midxed_object_oid(, m, i);
+   if (!match_sha(ds->len, ds->bin_pfx.hash, current->hash))
+   break;
+   update_candidates(ds, current);
+   }
+}
+
 static void find_short_packed_object(struct disambiguate_state *ds)
 {
struct packed_git *p;
+   struct midxed_git *m;
 
-   prepare_packed_git();
+   prepare_packed_git_internal(1);
+   for (m = midxed_git; m && !ds->ambiguous; m = m->next)
+   unique_in_midx(m, ds);
for (p = packed_git; p && !ds->ambiguous; p = p->next)
unique_in_pack(p, ds);
 }
@@ -508,6 +538,39 @@ static int extend_abbrev_len(const struct object_id *oid, 
void *cb_data)
return 0;
 }
 
+static void find_abbrev_len_for_midx(struct midxed_git *m,
+struct min_abbrev_data *mad)
+{
+   int match = 0;
+   uint32_t first = 0;
+   struct object_id oid;
+
+   if (!m->num_objects)
+   return;
+
+   match = bsearch_midx(m, mad->hash, );
+
+   /*
+* first is now the position in the packfile where we would insert
+* mad->hash if it does not exist (or the position of mad->hash if
+* it does exist). Hence, we consider a maximum of three objects
+* nearby for the abbreviation length.
+*/
+   mad->init_len = 0;
+   if (!match) {
+   nth_midxed_object_oid(, m, first);
+   extend_abbrev_len(, mad);
+   } else if (first < m->num_objects - 1) {
+   nth_midxed_object_oid(, m, first + 1);
+   extend_abbrev_len(, mad);
+   }
+   if (first > 0) {
+   nth_midxed_object_oid(, m, first - 1);
+   extend_abbrev_len(, mad);
+   }
+   mad->init_len = mad->cur_len;
+}
+
 static void find_abbrev_len_for_pack(struct packed_git *p,
 struct min_abbrev_data *mad)
 {
@@ -563,8 +626,11 @@ static void find_abbrev_len_for_pack(struct packed_git *p,
 static void find_abbrev_len_packed(struct min_abbrev_data *mad)
 {
struct packed_git *p;
+   struct midxed_git *m;
 
-   prepare_packed_git();
+   prepare_packed_git_internal(1);
+   for (m = midxed_git; m; m = m->next)
+   find_abbrev_len_for_midx(m, mad);
for (p = packed_git; p; p = p->next)
find_abbrev_len_for_pack(p, mad);
 }
-- 
2.15.0



[RFC PATCH 14/18] midx: load midx files when loading packs

2018-01-07 Thread Derrick Stolee
Replace prepare_packed_git() with prepare_packed_git_internal(use_midx) to
allow some consumers of prepare_packed_git() with a way to load MIDX files.
Consumers should only use the new method if they are prepared to use the
midxed_git struct alongside the packed_git struct.

If a packfile is found that is not referenced by the current MIDX, then add
it to the packed_git struct. This is important to keep the MIDX useful after
adding packs due to "fetch" commands and when third-party tools (such as
libgit2) add packs directly to the repo.

If prepare_packed_git_internal is called with use_midx = 0, then unload the
MIDX file and reload the packfiles in to the packed_git struct.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 midx.c | 57 +++
 midx.h |  6 --
 packfile.c | 64 +-
 packfile.h |  1 +
 4 files changed, 117 insertions(+), 11 deletions(-)

diff --git a/midx.c b/midx.c
index 3ce2b736ea..a66763b9e3 100644
--- a/midx.c
+++ b/midx.c
@@ -22,6 +22,9 @@
 
 #define MIDX_LARGE_OFFSET_NEEDED 0x8000
 
+/* MIDX-git global storage */
+struct midxed_git *midxed_git = 0;
+
 char* get_midx_filename_oid(const char *pack_dir,
struct object_id *oid)
 {
@@ -197,6 +200,45 @@ struct midxed_git *get_midxed_git(const char *pack_dir, 
struct object_id *oid)
return m;
 }
 
+static char* get_midx_filename_dir(const char *pack_dir)
+{
+   struct object_id oid;
+   if (!get_midx_head_oid(pack_dir, ))
+   return 0;
+
+   return get_midx_filename_oid(pack_dir, );
+}
+
+static int prepare_midxed_git_head(char *pack_dir, int local)
+{
+   struct midxed_git *m = midxed_git;
+   char *midx_head_path = get_midx_filename_dir(pack_dir);
+
+   if (!core_midx)
+   return 1;
+
+   if (midx_head_path) {
+   midxed_git = load_midxed_git_one(midx_head_path, pack_dir);
+   midxed_git->next = m;
+   FREE_AND_NULL(midx_head_path);
+   return 1;
+   }
+
+   return 0;
+}
+
+int prepare_midxed_git_objdir(char *obj_dir, int local)
+{
+   int ret;
+   struct strbuf pack_dir = STRBUF_INIT;
+   strbuf_addstr(_dir, obj_dir);
+   strbuf_add(_dir, "/pack", 5);
+
+   ret = prepare_midxed_git_head(pack_dir.buf, local);
+   strbuf_release(_dir);
+   return ret;
+}
+
 struct pack_midx_details_internal {
uint32_t pack_int_id;
uint32_t internal_offset;
@@ -677,3 +719,18 @@ int close_midx(struct midxed_git *m)
 
return 1;
 }
+
+void close_all_midx(void)
+{
+   struct midxed_git *m = midxed_git;
+   struct midxed_git *next;
+
+   while (m) {
+   next = m->next;
+   close_midx(m);
+   free(m);
+   m = next;
+   }
+
+   midxed_git = 0;
+}
diff --git a/midx.h b/midx.h
index 27d48163e9..d8ede8121c 100644
--- a/midx.h
+++ b/midx.h
@@ -27,7 +27,7 @@ struct pack_midx_header {
uint32_t num_packs;
 };
 
-struct midxed_git {
+extern struct midxed_git {
struct midxed_git *next;
 
int midx_fd;
@@ -81,9 +81,10 @@ struct midxed_git {
 
/* something like ".git/objects/pack" */
char pack_dir[FLEX_ARRAY]; /* more */
-};
+} *midxed_git;
 
 extern struct midxed_git *get_midxed_git(const char *pack_dir, struct 
object_id *oid);
+extern int prepare_midxed_git_objdir(char *obj_dir, int local);
 
 struct pack_midx_details {
uint32_t pack_int_id;
@@ -118,5 +119,6 @@ extern const char *write_midx_file(const char *pack_dir,
   uint32_t nr_objects);
 
 extern int close_midx(struct midxed_git *m);
+extern void close_all_midx(void);
 
 #endif
diff --git a/packfile.c b/packfile.c
index c36420b33f..1c0822878b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -8,6 +8,7 @@
 #include "list.h"
 #include "streaming.h"
 #include "sha1-lookup.h"
+#include "midx.h"
 
 char *odb_pack_name(struct strbuf *buf,
const unsigned char *sha1,
@@ -309,10 +310,22 @@ void close_pack(struct packed_git *p)
 void close_all_packs(void)
 {
struct packed_git *p;
+   struct midxed_git *m;
+
+   for (m = midxed_git; m; m = m->next) {
+   int i;
+   for (i = 0; i < m->num_packs; i++) {
+   p = m->packs[i];
+   if (p && p->do_not_close)
+   BUG("want to close pack marked 'do-not-close'");
+   else if (p)
+   close_pack(p);
+   }
+   }
 
for (p = packed_git; p; p = p->next)
if (p->do_not_close)
-   die("BUG: want to close pack marked 'do-not-close'");
+   BUG("wa

[RFC PATCH 12/18] midx: teach git-midx to delete expired files

2018-01-07 Thread Derrick Stolee
As we write new MIDX files, the existing files are probably not needed. Supply
the "--delete-expired" flag to remove these files during the "--write" sub-
command.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-midx.txt |  4 
 builtin/midx.c | 15 ++-
 midx.c | 26 ++
 midx.h |  2 ++
 packfile.c |  2 +-
 packfile.h |  1 +
 t/t5318-midx.sh|  9 ++---
 7 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index c184d3a593..4635247d0d 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -43,6 +43,10 @@ OPTIONS
If specified with --write, update the midx-head file to point to
the written midx file.
 
+--delete-expired::
+   If specified with --write and --update-head, delete the midx file
+   previously pointed to by midx-head (if changed).
+
 EXAMPLES
 
 
diff --git a/builtin/midx.c b/builtin/midx.c
index b30ef36ff8..6f56f39390 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -10,7 +10,7 @@
 
 static char const * const builtin_midx_usage[] = {
N_("git midx [--pack-dir ]"),
-   N_("git midx --write [--update-head] [--pack-dir ]"),
+   N_("git midx --write [--update-head [--delete-expired]] [--pack-dir 
]"),
N_("git midx --clear [--pack-dir ]"),
NULL
 };
@@ -22,6 +22,7 @@ static struct opts_midx {
const char *midx_id;
int write;
int update_head;
+   int delete_expired;
int has_existing;
struct object_id old_midx_oid;
 } opts;
@@ -276,6 +277,16 @@ static int midx_write(void)
if (opts.update_head)
update_head_file(opts.pack_dir, midx_id);
 
+   if (opts.delete_expired && opts.update_head && opts.has_existing &&
+   strcmp(midx_id, oid_to_hex(_midx_oid))) {
+   char *old_path = get_midx_filename_oid(opts.pack_dir, 
_midx_oid);
+   close_midx(midx);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+
+   free(old_path);
+   }
+
 cleanup:
if (pack_names)
FREE_AND_NULL(pack_names);
@@ -300,6 +311,8 @@ int cmd_midx(int argc, const char **argv, const char 
*prefix)
N_("write midx file")),
OPT_BOOL('u', "update-head", _head,
N_("update midx-head to written midx file")),
+   OPT_BOOL('d', "delete-expired", _expired,
+   N_("delete expired head midx file")),
OPT_END(),
};
 
diff --git a/midx.c b/midx.c
index 53eb29dac3..3ce2b736ea 100644
--- a/midx.c
+++ b/midx.c
@@ -651,3 +651,29 @@ const char *write_midx_file(const char *pack_dir,
 
return final_hex;
 }
+
+int close_midx(struct midxed_git *m)
+{
+   int i;
+   if (m->midx_fd < 0)
+   return 0;
+
+   for (i = 0; i < m->num_packs; i++) {
+   if (m->packs[i]) {
+   close_pack(m->packs[i]);
+   free(m->packs[i]);
+   m->packs[i] = NULL;
+   }
+   }
+
+   munmap((void *)m->data, m->data_len);
+   m->data = 0;
+
+   close(m->midx_fd);
+   m->midx_fd = -1;
+
+   free(m->packs);
+   free(m->pack_names);
+
+   return 1;
+}
diff --git a/midx.h b/midx.h
index 1e7a94651c..27d48163e9 100644
--- a/midx.h
+++ b/midx.h
@@ -117,4 +117,6 @@ extern const char *write_midx_file(const char *pack_dir,
   struct pack_midx_entry **objects,
   uint32_t nr_objects);
 
+extern int close_midx(struct midxed_git *m);
+
 #endif
diff --git a/packfile.c b/packfile.c
index 4a5fe7ab18..c36420b33f 100644
--- a/packfile.c
+++ b/packfile.c
@@ -299,7 +299,7 @@ void close_pack_index(struct packed_git *p)
}
 }
 
-static void close_pack(struct packed_git *p)
+void close_pack(struct packed_git *p)
 {
close_pack_windows(p);
close_pack_fd(p);
diff --git a/packfile.h b/packfile.h
index 0cdeb54dcd..7cf4771029 100644
--- a/packfile.h
+++ b/packfile.h
@@ -61,6 +61,7 @@ extern void close_pack_index(struct packed_git *);
 
 extern unsigned char *use_pack(struct packed_git *, struct pack_window **, 
off_t, unsigned long *);
 extern void close_pack_windows(struct packed_git *);
+extern void close_pack(struct packed_git *p);
 extern void close_all_packs(void);
 extern void unuse_pack(struct pack_window **);
 extern void clear_delta_base_cache(void);
diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
index 9337355ab3..42d103c879 100755
--- a/t/t5318-midx.sh
+++ 

[RFC PATCH 10/18] midx: use existing midx when writing

2018-01-07 Thread Derrick Stolee
When writing a new MIDX file, it is faster to use an existing MIDX file
to load the object list and pack offsets and to only inspect pack-indexes
for packs not already covered by the MIDX file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/midx.c | 34 +++---
 midx.c | 23 +++
 midx.h |  2 ++
 3 files changed, 56 insertions(+), 3 deletions(-)

diff --git a/builtin/midx.c b/builtin/midx.c
index ee9234583d..aff2085771 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -73,7 +73,7 @@ static int midx_read(void)
 static int build_midx_from_packs(
const char *pack_dir,
const char **pack_names, uint32_t nr_packs,
-   const char **midx_id)
+   const char **midx_id, struct midxed_git *midx)
 {
struct packed_git **packs;
const char **installed_pack_names;
@@ -86,6 +86,9 @@ static int build_midx_from_packs(
struct strbuf pack_path = STRBUF_INIT;
int baselen;
 
+   if (midx)
+   nr_total_packs += midx->num_packs;
+
if (!nr_total_packs) {
*midx_id = NULL;
return 0;
@@ -94,6 +97,12 @@ static int build_midx_from_packs(
ALLOC_ARRAY(packs, nr_total_packs);
ALLOC_ARRAY(installed_pack_names, nr_total_packs);
 
+   if (midx) {
+   for (i = 0; i < midx->num_packs; i++)
+   installed_pack_names[nr_installed_packs++] = 
midx->pack_names[i];
+   pack_offset = midx->num_packs;
+   }
+
strbuf_addstr(_path, pack_dir);
strbuf_addch(_path, '/');
baselen = pack_path.len;
@@ -101,6 +110,9 @@ static int build_midx_from_packs(
strbuf_setlen(_path, baselen);
strbuf_addstr(_path, pack_names[i]);
 
+   if (midx && contains_pack(midx, pack_names[i]))
+   continue;
+
strbuf_strip_suffix(_path, ".pack");
strbuf_addstr(_path, ".idx");
 
@@ -120,13 +132,24 @@ static int build_midx_from_packs(
if (!nr_objects || !nr_installed_packs) {
FREE_AND_NULL(packs);
FREE_AND_NULL(installed_pack_names);
-   *midx_id = NULL;
+
+   if (opts.has_existing)
+   *midx_id = oid_to_hex(_midx_oid);
+   else
+   *midx_id = NULL;
+
return 0;
}
 
+   if (midx)
+   nr_objects += midx->num_objects;
+
ALLOC_ARRAY(objects, nr_objects);
nr_objects = 0;
 
+   for (i = 0; midx && i < midx->num_objects; i++)
+   nth_midxed_object_entry(midx, i, [nr_objects++]);
+
for (i = pack_offset; i < nr_installed_packs; i++) {
struct packed_git *p = packs[i];
 
@@ -184,6 +207,10 @@ static int midx_write(void)
const char *midx_id = 0;
DIR *dir;
struct dirent *de;
+   struct midxed_git *midx = NULL;
+
+   if (opts.has_existing)
+   midx = get_midxed_git(opts.pack_dir, _midx_oid);
 
dir = opendir(opts.pack_dir);
if (!dir) {
@@ -212,7 +239,8 @@ static int midx_write(void)
if (!nr_packs)
goto cleanup;
 
-   if (build_midx_from_packs(opts.pack_dir, pack_names, nr_packs, 
_id))
+   if (build_midx_from_packs(opts.pack_dir, pack_names,
+ nr_packs, _id, midx))
die("failed to build MIDX");
 
if (midx_id == NULL)
diff --git a/midx.c b/midx.c
index 4e0df0285a..53eb29dac3 100644
--- a/midx.c
+++ b/midx.c
@@ -257,6 +257,29 @@ const struct object_id *nth_midxed_object_oid(struct 
object_id *oid,
return oid;
 }
 
+int contains_pack(struct midxed_git *m, const char *pack_name)
+{
+   uint32_t first = 0, last = m->num_packs;
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const char *current;
+   int cmp;
+
+   current = m->pack_names[mid];
+   cmp = strcmp(pack_name, current);
+   if (!cmp)
+   return 1;
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   return 0;
+}
+
 static int midx_sha1_compare(const void *_a, const void *_b)
 {
struct pack_midx_entry *a = *(struct pack_midx_entry **)_a;
diff --git a/midx.h b/midx.h
index 9255909ae8..1e7a94651c 100644
--- a/midx.h
+++ b/midx.h
@@ -100,6 +100,8 @@ extern const struct object_id *nth_midxed_object_oid(struct 
object_id *oid,
 struct midxed_git *m,
 uint32_t n);
 
+extern int contains_pack(struct midxed_git *m, const char *pack_name);
+
 /*
  * Write a single MI

[RFC PATCH 09/18] midx: find details of nth object in midx

2018-01-07 Thread Derrick Stolee
The MIDX file stores pack offset information for a list of objects. The
nth_midxed_object_* methods provide ways to extract this information.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 midx.c | 55 +++
 midx.h | 15 +++
 2 files changed, 70 insertions(+)

diff --git a/midx.c b/midx.c
index c631be451f..4e0df0285a 100644
--- a/midx.c
+++ b/midx.c
@@ -202,6 +202,61 @@ struct pack_midx_details_internal {
uint32_t internal_offset;
 };
 
+struct pack_midx_details *nth_midxed_object_details(struct midxed_git *m,
+   uint32_t n,
+   struct pack_midx_details *d)
+{
+   struct pack_midx_details_internal *d_internal;
+   const unsigned char *details = m->chunk_object_offsets;
+
+   if (n >= m->num_objects)
+   return NULL;
+
+   d_internal = (struct pack_midx_details_internal*)(details + 8 * n);
+   d->pack_int_id = ntohl(d_internal->pack_int_id);
+   d->offset = ntohl(d_internal->internal_offset);
+
+   if (m->chunk_large_offsets && d->offset & MIDX_LARGE_OFFSET_NEEDED) {
+   uint32_t large_offset = d->offset ^ MIDX_LARGE_OFFSET_NEEDED;
+   const unsigned char *large_offsets = m->chunk_large_offsets + 8 
* large_offset;
+
+   d->offset =  (((uint64_t)ntohl(*((uint32_t *)(large_offsets + 
0 << 32) |
+ntohl(*((uint32_t *)(large_offsets + 
4)));
+   }
+
+   return d;
+}
+
+struct pack_midx_entry *nth_midxed_object_entry(struct midxed_git *m,
+   uint32_t n,
+   struct pack_midx_entry *e)
+{
+   struct pack_midx_details details;
+   const unsigned char *index = m->chunk_oid_lookup;
+
+   if (!nth_midxed_object_details(m, n, ))
+   return NULL;
+
+   memcpy(e->oid.hash, index + m->hdr->hash_len * n, m->hdr->hash_len);
+   e->pack_int_id = details.pack_int_id;
+   e->offset = details.offset;
+
+   return e;
+}
+
+const struct object_id *nth_midxed_object_oid(struct object_id *oid,
+ struct midxed_git *m,
+ uint32_t n)
+{
+   struct pack_midx_entry e;
+
+   if (!nth_midxed_object_entry(m, n, ))
+   return 0;
+
+   hashcpy(oid->hash, e.oid.hash);
+   return oid;
+}
+
 static int midx_sha1_compare(const void *_a, const void *_b)
 {
struct pack_midx_entry *a = *(struct pack_midx_entry **)_a;
diff --git a/midx.h b/midx.h
index 92b74e49db..9255909ae8 100644
--- a/midx.h
+++ b/midx.h
@@ -85,6 +85,21 @@ struct midxed_git {
 
 extern struct midxed_git *get_midxed_git(const char *pack_dir, struct 
object_id *oid);
 
+struct pack_midx_details {
+   uint32_t pack_int_id;
+   off_t offset;
+};
+
+extern struct pack_midx_details *nth_midxed_object_details(struct midxed_git 
*m,
+  uint32_t n,
+  struct 
pack_midx_details *d);
+extern struct pack_midx_entry *nth_midxed_object_entry(struct midxed_git *m,
+  uint32_t n,
+  struct pack_midx_entry 
*e);
+extern const struct object_id *nth_midxed_object_oid(struct object_id *oid,
+struct midxed_git *m,
+uint32_t n);
+
 /*
  * Write a single MIDX file storing the given entries for the
  * given list of packfiles. If midx_name is null, then a temp
-- 
2.15.0



[RFC PATCH 08/18] midx: teach git-midx to read midx file details

2018-01-07 Thread Derrick Stolee
Commentary: I included the pack directory of the MIDX file as a FLEX_ARRAY
at the end of the midxed_git struct, similar to how the pack name appears
at the end of the packed_git struct. A colleague mentioned this pattern is
confusing and possibly dangerous so I should consider changing it. If there
is no strong reason for this, then I will modify the struct before the v1
patch to use a char*.

-- >8 --

Add a "--read" subcommand to the midx builtin to report summary information
on the head MIDX file or a MIDX file specified by the supplied "--midx-id"
parameter.

This subcommand is used by t5318-midx.sh to verify the indexed objects are
as expected.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-midx.txt |  23 +++-
 builtin/midx.c |  59 
 midx.c | 132 +
 midx.h |  58 
 t/t5318-midx.sh|  79 +++
 5 files changed, 328 insertions(+), 23 deletions(-)

diff --git a/Documentation/git-midx.txt b/Documentation/git-midx.txt
index 01f79cbba5..3eeed1d969 100644
--- a/Documentation/git-midx.txt
+++ b/Documentation/git-midx.txt
@@ -9,7 +9,7 @@ git-midx - Write and verify multi-pack-indexes (MIDX files).
 SYNOPSIS
 
 [verse]
-'git midx' --write  [--pack-dir ]
+'git midx' [--write|--read]  [--pack-dir ]
 
 DESCRIPTION
 ---
@@ -22,9 +22,18 @@ OPTIONS
Use given directory for the location of packfiles, pack-indexes,
and MIDX files.
 
+--read::
+   If specified, read a midx file specified by the midx-head file
+   and output basic details about the midx file. (Cannot be combined
+   with --write.)
+
+--midx-id ::
+   If specified with --read, use the given oid to read midx-[oid].midx
+   instead of using midx-head.
 --write::
If specified, write a new midx file to the pack directory using
the packfiles present. Outputs the hash of the result midx file.
+   (Cannot be combined with --read.)
 
 --update-head::
If specified with --write, update the midx-head file to point to
@@ -58,6 +67,18 @@ $ git midx --write --update-head
 $ git midx --write --pack-dir ../../alt/pack/
 -
 
+* Read the current midx-head.
++
+---
+$ git midx --read
+---
+
+* Read a specific MIDX file in the local .git folder.
++
+
+$ git midx --read --midx-id 3e50d982a2257168c7fd0ff12ffe5cf6af38c74e
+
+
 CONFIGURATION
 -
 
diff --git a/builtin/midx.c b/builtin/midx.c
index 84ce6588a2..ee9234583d 100644
--- a/builtin/midx.c
+++ b/builtin/midx.c
@@ -16,12 +16,60 @@ static char const * const builtin_midx_usage[] = {
 
 static struct opts_midx {
const char *pack_dir;
+   int read;
+   const char *midx_id;
int write;
int update_head;
int has_existing;
struct object_id old_midx_oid;
 } opts;
 
+static int midx_read(void)
+{
+   struct object_id midx_oid;
+   struct midxed_git *midx;
+   uint32_t i;
+
+   if (opts.midx_id && strlen(opts.midx_id) == GIT_MAX_HEXSZ)
+   get_oid_hex(opts.midx_id, _oid);
+   else if (!get_midx_head_oid(opts.pack_dir, _oid))
+   die("No midx-head exists.");
+
+   midx = get_midxed_git(opts.pack_dir, _oid);
+
+   printf("header: %08x %x %d %d %d %d %d\n",
+   ntohl(midx->hdr->midx_signature),
+   ntohl(midx->hdr->midx_version),
+   midx->hdr->hash_version,
+   midx->hdr->hash_len,
+   midx->hdr->num_base_midx,
+   midx->hdr->num_chunks,
+   ntohl(midx->hdr->num_packs));
+   printf("num_objects: %d\n", midx->num_objects);
+   printf("chunks:");
+
+   if (midx->chunk_pack_lookup)
+   printf(" pack_lookup");
+   if (midx->chunk_pack_names)
+   printf(" pack_names");
+   if (midx->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (midx->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (midx->chunk_object_offsets)
+   printf(" object_offsets");
+   if (midx->chunk_large_offsets)
+   printf(" large_offsets");
+   printf("\n");
+
+   printf("pack_names:\n");
+   for (i = 0; i < midx->num_packs; i++)
+   printf("%s\n", midx->pack_names[i]);
+
+   printf("pack_dir: %s\n", midx->pack_dir);
+   return 0;
+}
+
 st

[RFC PATCH 18/18] packfile: use midx for object loads

2018-01-07 Thread Derrick Stolee
When looking for a packed object, first check the MIDX for that object.
This reduces thrashing in the MRU list of packfiles.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 midx.c | 84 ++
 midx.h |  3 +++
 packfile.c |  5 +++-
 3 files changed, 91 insertions(+), 1 deletion(-)

diff --git a/midx.c b/midx.c
index 8c643caa92..4b2398b3ee 100644
--- a/midx.c
+++ b/midx.c
@@ -329,6 +329,90 @@ int bsearch_midx(struct midxed_git *m, const unsigned char 
*sha1, uint32_t *pos)
return 0;
 }
 
+static int prepare_midx_pack(struct midxed_git *m, uint32_t pack_int_id)
+{
+   struct strbuf pack_name = STRBUF_INIT;
+
+   if (pack_int_id >= m->hdr->num_packs)
+   return 1;
+
+   if (m->packs[pack_int_id])
+   return 0;
+
+   strbuf_addstr(_name, m->pack_dir);
+   strbuf_addstr(_name, "/");
+   strbuf_addstr(_name, m->pack_names[pack_int_id]);
+   strbuf_strip_suffix(_name, ".pack");
+   strbuf_addstr(_name, ".idx");
+
+   m->packs[pack_int_id] = add_packed_git(pack_name.buf, pack_name.len, 1);
+   strbuf_release(_name);
+   return !m->packs[pack_int_id];
+}
+
+static int find_pack_entry_midx(const unsigned char *sha1,
+   struct midxed_git *m,
+   struct packed_git **p,
+   off_t *offset)
+{
+   uint32_t pos;
+   struct pack_midx_details d;
+
+   if (!bsearch_midx(m, sha1, ) ||
+   !nth_midxed_object_details(m, pos, ))
+   return 0;
+
+   if (d.pack_int_id >= m->num_packs)
+   die(_("bad pack-int-id %d"), d.pack_int_id);
+
+   /* load packfile, if necessary */
+   if (prepare_midx_pack(m, d.pack_int_id))
+   return 0;
+
+   *p = m->packs[d.pack_int_id];
+   *offset = d.offset;
+
+   return 1;
+}
+
+int fill_pack_entry_midx(const unsigned char *sha1,
+struct pack_entry *e)
+{
+   struct packed_git *p;
+   struct midxed_git *m;
+
+   if (!core_midx)
+   return 0;
+
+   m = midxed_git;
+   while (m)
+   {
+   off_t offset;
+   if (!find_pack_entry_midx(sha1, m, , )) {
+   m = m->next;
+   continue;
+   }
+
+   /*
+   * We are about to tell the caller where they can locate the
+   * requested object.  We better make sure the packfile is
+   * still here and can be accessed before supplying that
+   * answer, as it may have been deleted since the MIDX was
+   * loaded!
+   */
+   if (!is_pack_valid(p))
+   return 0;
+
+   e->offset = offset;
+   e->p = p;
+   hashcpy(e->sha1, sha1);
+
+   return 1;
+   }
+
+   return 0;
+}
+
 int contains_pack(struct midxed_git *m, const char *pack_name)
 {
uint32_t first = 0, last = m->num_packs;
diff --git a/midx.h b/midx.h
index 5598799189..b7e8b15fe4 100644
--- a/midx.h
+++ b/midx.h
@@ -11,6 +11,9 @@ extern char *get_midx_head_filename(const char *pack_dir);
 
 extern struct object_id *get_midx_head_oid(const char *pack_dir, struct 
object_id *oid);
 
+extern int fill_pack_entry_midx(const unsigned char *sha1,
+   struct pack_entry *e);
+
 struct pack_midx_entry {
struct object_id oid;
uint32_t pack_int_id;
diff --git a/packfile.c b/packfile.c
index 866a1f30dd..9ec39a83e9 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1883,7 +1883,10 @@ int find_pack_entry(const unsigned char *sha1, struct 
pack_entry *e)
 {
struct mru_entry *p;
 
-   prepare_packed_git();
+   prepare_packed_git_internal(1);
+   if (fill_pack_entry_midx(sha1, e))
+   return 1;
+
if (!packed_git)
return 0;
 
-- 
2.15.0



[RFC PATCH 06/18] midx: add t5318-midx.sh test script

2018-01-07 Thread Derrick Stolee
Test interactions between the midx builtin and other Git operations.

Use both a full repo and a bare repo to ensure the pack directory
redirection works correctly.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 t/t5318-midx.sh | 100 
 1 file changed, 100 insertions(+)
 create mode 100755 t/t5318-midx.sh

diff --git a/t/t5318-midx.sh b/t/t5318-midx.sh
new file mode 100755
index 00..869bbea29c
--- /dev/null
+++ b/t/t5318-midx.sh
@@ -0,0 +1,100 @@
+#!/bin/sh
+
+test_description='multi-pack indexes'
+. ./test-lib.sh
+
+test_expect_success 'config' \
+'rm -rf .git &&
+ mkdir full &&
+ cd full &&
+ git init &&
+ git config core.midx true &&
+ git config pack.threads 1 &&
+ packdir=.git/objects/pack'
+
+test_expect_success 'write-midx with no packs' \
+'midx0=$(git midx --write) &&
+ test "a$midx0" = "a"'
+
+test_expect_success 'create objects' \
+'for i in $(test_seq 100)
+ do
+ echo $i >file-1-$i
+ done &&
+ git add file-* &&
+ test_tick &&
+ git commit -m "test data 1" &&
+ git branch commit1 HEAD'
+
+test_expect_success 'write-midx from index version 1' \
+'pack1=$(git rev-list --all --objects | git pack-objects --index-version=1 
${packdir}/test-1) &&
+ midx1=$(git midx --write) &&
+ test_path_is_file ${packdir}/midx-${midx1}.midx'
+
+test_expect_success 'write-midx from index version 2' \
+'rm "${packdir}/test-1-${pack1}.pack" &&
+ pack2=$(git rev-list --all --objects | git pack-objects --index-version=2 
${packdir}/test-2) &&
+ midx2=$(git midx --write) &&
+ test_path_is_file ${packdir}/midx-${midx2}.midx'
+
+test_expect_success 'Create more objects' \
+'for i in $(test_seq 100)
+ do
+ echo $i >file-2-$i
+ done &&
+ git add file-* &&
+ test_tick &&
+ git commit -m "test data 2" &&
+ git branch commit2 HEAD'
+
+test_expect_success 'write-midx with two packs' \
+'pack3=$(git rev-list --objects commit2 ^commit1 | git pack-objects 
--index-version=2 ${packdir}/test-3) &&
+ midx3=$(git midx --write) &&
+ test_path_is_file ${packdir}/midx-${midx3}.midx'
+
+test_expect_success 'Add more packs' \
+'for j in $(test_seq 10)
+ do
+ jjj=$(printf '%03i' $j)
+ test-genrandom "bar" 200 > wide_delta_$jjj &&
+ test-genrandom "baz $jjj" 50 >> wide_delta_$jjj &&
+ test-genrandom "foo"$j 100 > deep_delta_$jjj &&
+ test-genrandom "foo"$(expr $j + 1) 100 >> deep_delta_$jjj &&
+ test-genrandom "foo"$(expr $j + 2) 100 >> deep_delta_$jjj &&
+ echo $jjj >file_$jjj &&
+ test-genrandom "$jjj" 8192 >>file_$jjj &&
+ git update-index --add file_$jjj deep_delta_$jjj wide_delta_$jjj &&
+ { echo 101 && test-genrandom 100 8192; } >file_101 &&
+ git update-index --add file_101 &&
+ commit=$(git commit-tree $EMPTY_TREE -p HEADobj-list &&
+ echo commit_packs_$j = $commit &&
+git branch commit_packs_$j $commit &&
+ git update-ref HEAD $commit &&
+ git pack-objects --index-version=2 ${packdir}/test-pack 

[RFC PATCH 16/18] midx: nth_midxed_object_oid() and bsearch_midx()

2018-01-07 Thread Derrick Stolee
Using a binary search, we can navigate to the position n within a
MIDX file where an object appears in the ordered list of objects.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 midx.c | 30 ++
 midx.h |  9 +
 2 files changed, 39 insertions(+)

diff --git a/midx.c b/midx.c
index a66763b9e3..8c643caa92 100644
--- a/midx.c
+++ b/midx.c
@@ -299,6 +299,36 @@ const struct object_id *nth_midxed_object_oid(struct 
object_id *oid,
return oid;
 }
 
+int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, uint32_t 
*pos)
+{
+   uint32_t last, first = 0;
+
+   if (sha1[0])
+   first = ntohl(*(uint32_t*)(m->chunk_oid_fanout + 4 * (sha1[0] - 
1)));
+   last = ntohl(*(uint32_t*)(m->chunk_oid_fanout + 4 * sha1[0]));
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = m->chunk_oid_lookup + m->hdr->hash_len * mid;
+   cmp = hashcmp(sha1, current);
+   if (!cmp) {
+   *pos = mid;
+   return 1;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   *pos = first;
+   return 0;
+}
+
 int contains_pack(struct midxed_git *m, const char *pack_name)
 {
uint32_t first = 0, last = m->num_packs;
diff --git a/midx.h b/midx.h
index d8ede8121c..5598799189 100644
--- a/midx.h
+++ b/midx.h
@@ -101,6 +101,15 @@ extern const struct object_id 
*nth_midxed_object_oid(struct object_id *oid,
 struct midxed_git *m,
 uint32_t n);
 
+/*
+ * Perform a binary search on the object list in a MIDX file for the given 
sha1.
+ *
+ * If the object exists, then return 1 and set *pos to the position of the 
sha1.
+ * Otherwise, return 0 and set *pos to the position of the lex-first object 
greater
+ * than the given sha1.
+ */
+extern int bsearch_midx(struct midxed_git *m, const unsigned char *sha1, 
uint32_t *pos);
+
 extern int contains_pack(struct midxed_git *m, const char *pack_name);
 
 /*
-- 
2.15.0



Re: merge-base --is-ancestor A B is unreasonably slow with unrelated history B

2018-01-09 Thread Derrick Stolee

On 1/9/2018 10:17 AM, Ævar Arnfjörð Bjarmason wrote:

This is a pathological case I don't have time to dig into right now:

 git branch -D orphan;
 git checkout --orphan orphan &&
 git reset --hard &&
 touch foo &&
 git add foo &&
 git commit -m"foo" &&
 time git merge-base --is-ancestor master orphan

This takes around 5 seconds on linux.git to return 1. Which is around
the same time it takes to run current master against the first commit in
linux.git:

 git merge-base --is-ancestor 1da177e4c3f4 master

This is obviously a pathological case, but maybe we should work slightly
harder on the RHS of and discover that it itself is an orphan commit.

I ran into this while writing a hook where we'd like to do:

 git diff $master...topic

Or not, depending on if the topic is an orphan or just something
recently branched off, figured I could use --is-ancestor as on
optimization, and then discovered it's not much of an optimization.


Ævar,

This is the same performance problem that we are trying to work around 
with Jeff's "Add --no-ahead-behind to status" patch [1]. For commits 
that are far apart, many commits need to be parsed. I think the right 
solution is to create a serialized commit graph that stores the 
adjacency information of the commits and can create commit structs 
quickly. This requires storing the commit id, commit date, parents, and 
root tree id to satisfy the needs of parse_commit_gently(). Once the 
framework for this data is constructed, it is simple to add generation 
numbers to that data and start consuming them in other algorithms (by 
adding the field to 'struct commit').


I'm working on such a patch right now, but it will be a few weeks before 
I'm ready.


Thanks,
-Stolee

[1] v5 of --no-ahead-behind 
https://public-inbox.org/git/20180109185018.69164-1-...@jeffhostetler.com/T/#t


[2] v4 of --no-ahead-behind 
https://public-inbox.org/git/nycvar.qro.7.76.6.1801091744540...@zvavag-6oxh6da.rhebcr.pbec.zvpebfbsg.pbz/T/#t





Re: [PATCH v4 0/4] Add --no-ahead-behind to status

2018-01-09 Thread Derrick Stolee

On 1/9/2018 8:15 AM, Johannes Schindelin wrote:

Hi Peff,

On Tue, 9 Jan 2018, Jeff King wrote:


On Mon, Jan 08, 2018 at 03:04:20PM -0500, Jeff Hostetler wrote:


I was thinking about something similar to the logic we use today
about whether to start reporting progress on other long commands.
That would mean you could still get the ahead/behind values if you
aren't that far behind but would only get "different" if that
calculation gets too expensive (which implies the actual value isn't
going to be all that helpful anyway).

After a off-line conversation with the others I'm going to look into
a version that is limited to n commits rather than be completely on or
off.  I think if you are for example less than 100 a/b then those numbers
have value; if you are past n, then they have much less value.

I'd rather do it by a fixed limit than by time to ensure that output
is predictable on graph shape and not on system load.

I like this direction a lot. I had hoped we could say "100+ commits
ahead",

How about "100+ commits apart" instead?


Unfortunately, we can _never_ guarantee more than 1 commit ahead/behind 
unless we walk to the merge base (or have generation numbers). For 
example, suppose the 101st commit in each history has a parent that in 
the recent history of the other commit. (There must be merge commits to 
make this work without creating cycles, but the ahead/behind counts 
could be much lower than the number of walked commits.)





but I don't think we can do so accurately without generation numbers.

Even with generation numbers, it is not possible to say whether two given
commits reflect diverging branches or have an ancestor-descendant
relationship (or in graph speak: whether they are comparable).


If you walk commits using a priority queue where the priority is the 
generation number, then you can know that you have walked all reachable 
commits with generation greater than X, so you know among those commits 
which are comparable or not.


For this to work accurately, you must walk from both tips to a 
generation lower than each. It does not help the case where one branch 
is 100,000+ commits ahead, where most of those commits have higher 
generation number than the behind commit.



It could potentially make it possible to cut off the commit traversal, but
I do not even see how that would be possible.

The only thing you could say for sure is that two different commits with
the same generation number are for sure "uncomparable", i.e. reflect
diverging branches.


E.g., the case I mentioned at the bottom of this mail:

   https://public-inbox.org/git/20171224143318.gc23...@sigill.intra.peff.net/

I haven't thought too hard on it, but I suspect with generation numbers
you could bound it and at least say "100+ ahead" or "100+ behind".

If you have walked 100 commits and still have not found a merge base,
there is no telling whether one start point is the ancestor of the other.
All you can say is that there are more than 100 commits in the
"difference".

You would not even be able to say that the *shortest* path between those
two start points is longer than 100 commits, you can construct
pathological DAGs pretty easily.

Even if you had generation numbers, and one commit's generation number
was, say, 17, and the other one's was 17,171, you could not necessarily
assume that the 17 one is the ancestor of the 17,171 one, all you can say
that it is not possible the other way round.


This is why we cannot _always_ use generation numbers, but they do help 
in some cases.



But I don't think you can approximate both ahead and behind together
without finding the actual merge base.

But even still, finding small answers quickly and accurately and punting
to "really far, I didn't bother to compute it" on the big ones would be
an improvement over always punting.

Indeed. The longer I think about it, the more I like the "100+ commits
apart" idea.



Again, I strongly suggest we drop this approach because it will be more 
pain than it is worth.


Thanks,
-Stolee


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-07 Thread Derrick Stolee

On 1/7/2018 5:42 PM, Ævar Arnfjörð Bjarmason wrote:


On Sun, Jan 07 2018, Derrick Stolee jotted:


 git log --oneline --raw --parents

Num Packs | Before MIDX | After MIDX |  Rel % | 1 pack %
--+-+++--
 1 | 35.64 s |35.28 s |  -1.0% |   -1.0%
24 | 90.81 s |40.06 s | -55.9% |  +12.4%
   127 |257.97 s |42.25 s | -83.6% |  +18.6%

The last column is the relative difference between the MIDX-enabled repo
and the single-pack repo. The goal of the MIDX feature is to present the
ODB as if it was fully repacked, so there is still room for improvement.

Changing the command to

 git log --oneline --raw --parents --abbrev=40

has no observable difference (sub 1% change in all cases). This is likely
due to the repack I used putting commits and trees in a small number of
packfiles so the MRU cache workes very well. On more naturally-created
lists of packfiles, there can be up to 20% improvement on this command.

We are using a version of this patch with an upcoming release of GVFS.
This feature is particularly important in that space since GVFS performs
a "prefetch" step that downloads a pack of commits and trees on a daily
basis. These packfiles are placed in an alternate that is shared by all
enlistments. Some users have 150+ packfiles and the MRU misses and
abbreviation computations are significant. Now, GVFS manages the MIDX file
after adding new prefetch packfiles using the following command:

 git midx --write --update-head --delete-expired --pack-dir=


(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.


The point of including the --abbrev=40 is to point out that object 
lookups do not get slower with the MIDX feature. Using these "git log" 
options is a good way to balance object lookups and abbreviations with 
object parsing and diff machinery. And while the public data shape I 
shared did not show a difference, our private testing of the Windows 
repository did show a valuable improvement when isolating to object 
lookups and ignoring abbreviation calculations.



The reason we do that at all is because it makes for a prettier UI.


We tried setting core.abbrev=40 on GVFS enlistments to speed up 
performance and the users rebelled against the hideous output. They 
would rather have slower speeds than long hashes.



Are there things that both want the pretty SHA-1 and also care about the
throughput? I'd have expected machine parsing to just use
--no-abbrev-commit.


The --raw flag outputs blob hashes, so the --abbrev=40 covers all hashes.


If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.


Unfortunately, with the way the abbreviation algorithms work, using 
--abbrev=20 will have similar performance problems because you still 
need to inspect all packfiles to ensure there isn't a collision in the 
first 20 hex characters.


Thanks,
-Stolee




Re: [RFC PATCH 01/18] docs: Multi-Pack Index (MIDX) Design Notes

2018-01-08 Thread Derrick Stolee

On 1/8/2018 2:32 PM, Jonathan Tan wrote:

On Sun,  7 Jan 2018 13:14:42 -0500
Derrick Stolee <sto...@gmail.com> wrote:


+Design Details
+--
+
+- The MIDX file refers only to packfiles in the same directory
+  as the MIDX file.
+
+- A special file, 'midx-head', stores the hash of the latest
+  MIDX file so we can load the file without performing a dirstat.
+  This file is especially important with incremental MIDX files,
+  pointing to the newest file.

I presume that the actual MIDX files are named by hash? (You might have
written this somewhere that I somehow missed.)

Also, I notice that in the "Future Work" section, the possibility of
multiple MIDX files is raised. Could this 'midx-head' file be allowed to
store multiple such files? That way, we avoid a bit of file format
churn (in that we won't need to define a new "chunk" in the future).


I hadn't considered this idea, and I like it. I'm not sure this is a 
robust solution, since isolated MIDX files don't contain information 
that they could use other MIDX files, or what order they should be in. I 
think the "order" of incremental MIDX files is important in a few ways 
(such as the "stable object order" idea).


I will revisit this idea when I come back with the incremental MIDX 
feature. For now, the only reference to "number of base MIDX files" is 
in one byte of the MIDX header. We should consider changing that byte 
for this patch.



+- If a packfile exists in the pack directory but is not referenced
+  by the MIDX file, then the packfile is loaded into the packed_git
+  list and Git can access the objects as usual. This behavior is
+  necessary since other tools could add packfiles to the pack
+  directory without notifying Git.
+
+- The MIDX file should be only a supplemental structure. If a
+  user downgrades or disables the `core.midx` config setting,
+  then the existing .idx and .pack files should be sufficient
+  to operate correctly.

Let me try to summarize: so, at this point, there are no
backwards-incompatible changes to the repo disk format. Unupdated code
paths (and old versions of Git) can just read the .idx and .pack files,
as always. Updated code paths will look at the .midx and .idx files, and
will sort them as follows:
  - .midx files go into a data structure
  - .idx files not referenced by any .midx files go into the
existing packed_git data structure

A writer can either merely write a new packfile (like old versions of
Git) or write a packfile and update the .midx file, and everything above
will still work. In the event that a writer deletes an existing packfile
referenced by a .midx (for example, old versions of Git during a
repack), we will lose the advantages of the .midx file - we will detect
that the .midx no longer works when attempting to read an object given
its information, but in this case, we can recover by dropping the .midx
file and loading all the .idx files it references that still exist.

As a reviewer, I think this is a very good approach, and this does make
things easier to review (as opposed to, say, an approach where a lot of
the code must be aware of .midx files).


Thanks! That is certainly the idea. If you know about MIDX, then you can 
benefit from it. If you do not, then you have all the same data 
available to you do to your work. Having a MIDX file will not break 
other tools (libgit2, JGit, etc.).


One thing I'd like to determine before this patch goes to v1 is how much 
we should make the other packfile-aware commands also midx-aware. My gut 
reaction right now is to have git-repack call 'git midx --clear' if 
core.midx=true and a packfile was deleted. However, this could easily be 
changed with 'git midx --clear' followed by 'git midx --write 
--update-head' if midx-head exists.


Thanks,
-Stolee


Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-10 Thread Derrick Stolee

On 1/10/2018 1:25 PM, Martin Fick wrote:

On Sunday, January 07, 2018 01:14:41 PM Derrick Stolee
wrote:

This RFC includes a new way to index the objects in
multiple packs using one file, called the multi-pack
index (MIDX).

...

The main goals of this RFC are:

* Determine interest in this feature.

* Find other use cases for the MIDX feature.

My interest in this feature would be to speed up fetches
when there is more than one large pack-file with many of the
same objects that are in other pack-files.   What does your
MIDX design do when it encounters multiple copies of the
same object in different pack files?  Does it index them all,
or does it keep a single copy?


The MIDX currently keeps only one reference to each object. Duplicates 
are dropped during writing. (See the care taken in commit 04/18 to avoid 
duplicates.) Since midx_sha1_compare() does not use anything other than 
the OID to order the objects, there is no decision being made about 
which pack is "better". The MIDX writes the first copy it finds and 
discards the others.


It would not be difficult to include a check in midx_sha1_compare() to 
favor one packfile over another based on some measurement (size? 
mtime?). Since this would be a heuristic at best, I left it out of the 
current patch.



In our Gerrit instance (Gerrit uses jgit), we have multiple
copies of the linux kernel repos linked together via the
alternatives file mechanism.


GVFS also uses alternates for sharing packfiles across multiple copies 
of the repo. The MIDX is designed to cover all packfiles in the same 
directory, but is not designed to cover packfiles in multiple 
alternates; currently, each alternate would need its own MIDX file. Does 
that cause issues with your setup?



   These repos have many different
references (mostly Gerrit change references), but they share
most of the common objects from the mainline.  I have found
that during a large fetch such as a clone, jgit spends a
significant amount of extra time by having the extra large
pack-files from the other repos visible to it, usually around
an extra minute per instance of these (without them, the
clone takes around 7mins).  This adds up easily with a few
repos extra repos, it can almost double the time.

My investigations have shown that this is due to jgit
searching each of these pack files to decide which version of
each object to send.  I don't fully understand its selection
criteria, however if I shortcut it to just pick the first
copy of an object that it finds, I regain my lost time.  I
don't know if git suffers from a similar problem?  If git
doesn't suffer from this then it likely just uses the first
copy of an object it finds (which may not be the best object
to send?)

It would be nice if this use case could be improved with
MIDX.  To do so, it seems that it would either require that
MIDX either only put "the best" version of an object (i.e.
pre-select which one to use), or include the extra
information to help make the selection process of which copy
to use (perhaps based on the operation being performed)
fast.


I'm not sure if there is sufficient value in storing multiple references 
to the same object stored in multiple packfiles. There could be value in 
carefully deciding which copy is "best" during the MIDX write, but 
during read is not a good time to make such a decision. It also 
increases the size of the file to store multiple copies.



This also leads me to ask, what other additional information
(bitmaps?) for other operations, besides object location,
might suddenly be valuable in an index that potentially
points to multiple copies of objects?  Would such
information be appropriate in MIDX, or would it be better in
another index?


For applications to bitmaps, it is probably best that we only include 
one copy of each object. Otherwise, we need to include extra bits in the 
bitmaps for those copies (when asking "is this object in the bitmap?").


Thanks for the context with Gerrit's duplicate object problem. I'll try 
to incorporate it in to the design document (commit 01/18) for the v1 patch.


Thanks,
-Stolee



Re: [RFC PATCH 00/18] Multi-pack index (MIDX)

2018-01-08 Thread Derrick Stolee

On 1/8/2018 5:20 AM, Jeff King wrote:

On Sun, Jan 07, 2018 at 07:08:54PM -0500, Derrick Stolee wrote:


(Not a critique of this, just a (stupid) question)

What's the practical use-case for this feature? Since it doesn't help
with --abbrev=40 the speedup is all in the part that ensures we don't
show an ambiguous SHA-1.

The point of including the --abbrev=40 is to point out that object lookups
do not get slower with the MIDX feature. Using these "git log" options is a
good way to balance object lookups and abbreviations with object parsing and
diff machinery. And while the public data shape I shared did not show a
difference, our private testing of the Windows repository did show a
valuable improvement when isolating to object lookups and ignoring
abbreviation calculations.

Just to make sure I'm parsing this correctly: normal lookups do get faster
when you have a single index, given the right setup?

I'm curious what that setup looked like. Is it just tons and tons of
packs? Is it ones where the packs do not follow the mru patterns very
well?


The way I repacked the Linux repo creates an artificially good set of 
packs for the MRU cache. When the packfiles are partitioned instead by 
the time the objects were pushed to a remote, the MRU cache performs 
poorly. Improving these object lookups are a primary reason for the MIDX 
feature, and almost all commands improve because of it. 'git log' is 
just the simplest to use for demonstration.



I think it's worth thinking a bit about, because...


If something cares about both throughput and e.g. is saving the
abbreviated SHA-1s isn't it better off picking some arbitrary size
(e.g. --abbrev=20), after all the default abbreviation is going to show
something as small as possible, which may soon become ambigous after the
next commit.

Unfortunately, with the way the abbreviation algorithms work, using
--abbrev=20 will have similar performance problems because you still need to
inspect all packfiles to ensure there isn't a collision in the first 20 hex
characters.

...if what we primarily care about speeding up is abbreviations, is it
crazy to consider disabling the disambiguation step entirely?

The results of find_unique_abbrev are already a bit of a probability
game. They're guaranteed at the moment of generation, but as more
objects are added, ambiguities may be introduced. Likewise, what's
unambiguous for you may not be for somebody else you're communicating
with, if they have their own clone.

Since we now scale the default abbreviation with the size of the repo,
that gives us a bounded and pretty reasonable probability that we won't
hit a collision at all[1].

I.e., what if we did something like this:

diff --git a/sha1_name.c b/sha1_name.c
index 611c7d24dd..04c661ba85 100644
--- a/sha1_name.c
+++ b/sha1_name.c
@@ -600,6 +600,15 @@ int find_unique_abbrev_r(char *hex, const unsigned char 
*sha1, int len)
if (len == GIT_SHA1_HEXSZ || !len)
return GIT_SHA1_HEXSZ;
  
+	/*

+* A default length of 10 implies a repository big enough that it's
+* getting expensive to double check the ambiguity of each object,
+* and the chance that any particular object of interest has a
+* collision is low.
+*/
+   if (len >= 10)
+   return len;
+
mad.init_len = len;
mad.cur_len = len;
mad.hex = hex;

If I repack my linux.git with "--max-pack-size=64m", I get 67 packs. The
patch above drops "git log --oneline --raw" on the resulting repo from
~150s to ~30s.

With a single pack, it goes from ~33s ~29s. Less impressive, but there's
still some benefit.

There may be other reasons to want MIDX or something like it, but I just
wonder if we can do this much simpler thing to cover the abbreviation
case. I guess the question is whether somebody is going to be annoyed in
the off chance that they hit a collision.


No only are users going to be annoyed when they hit collisions after 
copy-pasting an abbreviated hash, there are also a large number of tools 
that people build that use abbreviated hashes (either for presenting to 
users or because they didn't turn off abbreviations).


Abbreviations cause performance issues in other commands, too (like 
'fetch'!), so whatever short-circuit you put in, it would need to be 
global. A flag on one builtin would not suffice.



-Peff

[1] I'd have to check the numbers, but IIRC we've set the scaling so
 that the chance of having a _single_ collision in the repository is
 less than 50%, and rounding to the conservative side (since each hex
 char gives us 4 bits). And indeed, "git log --oneline --raw" on
 linux.git does not seem to have any collisions at its default of 12
 characters, at least in my clone.

 We could also consider switching core.disambiguate to "commit",
 which makes even a collision less likely to annoy the user.





Re: [PATCH 03/14] packed-graph: create git-graph builtin

2018-01-26 Thread Derrick Stolee

On 1/25/2018 6:01 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


Teach Git the 'graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a check that the core.graph setting is enabled
and a '--pack-dir' option.

Just to set my expectation straight.

Is it fair to say that in the ideal endgame state, this will be like
"git pack-objects" in that end users won't have to know about it,
but would serve as a crucial building block that is invoked by other
front-end commands that are more familiar to end users (just like
pack-objects are used behind the scenes by repack, push, etc.)?


That is my hope. Leaving that integration for later, after this feature 
has proven itself.


Re: [PATCH 04/14] packed-graph: add format document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 5:07 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote:

Add document specifying the binary format for packed graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/technical/graph-format.txt | 88 

So this is different from Documentation/technical/packed-graph.txt,
which gives high level design and this gives the details on how
to set bits.


  1 file changed, 88 insertions(+)
  create mode 100644 Documentation/technical/graph-format.txt

diff --git a/Documentation/technical/graph-format.txt 
b/Documentation/technical/graph-format.txt
new file mode 100644
index 00..a15e1036d7
--- /dev/null
+++ b/Documentation/technical/graph-format.txt
@@ -0,0 +1,88 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'C', 'G', 'P', 'H'}
+
+   1-byte version number:
+   Currently, the only valid version is 1.
+
+   1-byte Object Id Version (1 = SHA-1)
+
+   1-byte Object Id Length (H)

   This is 20 or 40 for sha1 ? (binary or text representation?)


20 for binary.




+   1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes listing the table of contents for the chunks:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.

... offset [in bytes/words/4k blocks?] in ...


bytes.




+   (Chunks are ordered contiguously in the file, so you can infer
+   the length using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of commits (N).

So F[0] > 0 for git.git for example.

Or another way: To lookup a 01xxx, I need to look at
entry(F[00] + 1 )...entry(F[01]).

Makes sense.


+
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all commits in the graph.

... sorted ascending.



+   Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+   * The first H bytes are for the OID of the root tree.
+   * The next 8 bytes are for the int-ids of the first two parents of
+ the ith commit. Stores value 0x if no parent in that 
position.
+ If there are more than two parents, the second value has its most-
+ significant bit on and the other bits store an offset into the 
Large
+ Edge List chunk.

s/an offset into/position in/ ? (otherwise offset in bytes?)


+   * The next 8 bytes store the generation number of the commit and the
+ commit time in seconds since EPOCH. The generation number uses the
+ higher 30 bits of the first 4 bytes, while the commit time uses 
the
+ 32 bits of the second 4 bytes, along with the lowest 2 bits of the
+ lowest byte, storing the 33rd and 34th bit of the commit time.

This allows for a maximum generation number of
1.073.741.823 (2^30 -1) = 1 billion,
and a max time stamp of later than 2100.

Do you allow negative time stamps?



+
+   [Optional] Large Edge List (ID: {'E', 'D', 'G', 'E'})
+   This list of 4-byte values store the second through nth parents for
+   

Re: [PATCH 01/14] graph: add packed graph design document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 4:14 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/technical/packed-graph.txt | 185 +++
  1 file changed, 185 insertions(+)
  create mode 100644 Documentation/technical/packed-graph.txt

I really wanted to like having this patch at the beginning, but
unfortunatelly I didn't see the actual file format description,
which was a bit disappointing.  An example of the things that I was
curious about was how the "integer ID" is used to access into the
file.  If we could somehow use "integer ID" as an index into an
array of fixed size elements, it would be ideal to gain "fast
lookups", but because of the "list of parents" thing, it needs some
trickery to do so, and that was among the things that I wanted to
see how much thought went into the design, for example.


There is definitely a chicken-or-the-egg situation here. I'm happy to 
start with the format before the design document.


I can try to expand this "integer ID" concept, but you can see how I use 
it in the following method from patch 11/14:


+int parse_packed_commit(struct commit *item)
+{
+    if (!core_graph)
+    return 0;
+    if (item->object.parsed)
+    return 1;
+
+    prepare_packed_graph();
+    if (packed_graph) {
+    uint32_t pos;
+    int found;
+    if (item->graphId != 0x) {
+    pos = item->graphId;
+    found = 1;
+    } else {
+    found = bsearch_graph(packed_graph, 
&(item->object.oid), );

+    }
+
+    if (found)
+    return fill_packed_commit(item, packed_graph, pos);
+    }
+
+    return 0;
+}

Note that if item->graphId has a "real" value (not 0x which in 
hindsight should be a macro) then we navigate directly to that position 
in the graph. Otherwise, we use binary search to query the graph's 
commit list to find the position (if the commit is packed).



diff --git a/Documentation/technical/packed-graph.txt 
b/Documentation/technical/packed-graph.txt
new file mode 100644
index 00..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.
+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:

s/data shape/history shape/ may make it even clearer.


+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.

Commits created after a packed graph file is built may of course not
appear in a packed graph file, but that is OK because they never need
to be listed as parents of commits in the file.  So "list of parents"
can always refer to the parents using the "integer ID for fast lookup".


One thing I need to test locally is what happens with boundary commits 
of a shallow clone. The commit's parents are not in the repo, so they 
will not be in the graph. I think that parse_commit_buffer() drops the 
parents, so the graph will treat them as root commits.



Makes sense.  Item 2. might want to say "The list of parents, using
the fast lookup integer ID (see 5.) as reference instead of OID",
though.


That will be more specific, thanks.


+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a root commit.

When a commit A can reach roots X and Y, and Y is further than X,
the distance between Y and A becomes A's generation number.  "One
more than the length of the path from the commit to the furthest
root commit it can reach", in other words.


My "Equivalently,..." sentence 

Re: [PATCH 03/14] packed-graph: create git-graph builtin

2018-01-26 Thread Derrick Stolee

On 1/25/2018 4:45 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote:

Teach Git the 'graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a check that the core.graph setting is enabled
and a '--pack-dir' option.

I wonder if this builtin should not respect the boolean core graph,
as this new builtin commands' whole existence
is to deal with these new files?

As you assume this builtin as a plumbing command, I would
expect it to pay less attention to config rather than more.


My thought was to alert the caller "This graph isn't going to be good 
for anything!" and fail quickly before doing work. You do have a good 
point, and I think we can remove that condition here. When we integrate 
with other commands ('repack', 'fetch', 'clone') we will want a 
different setting that signals automatically writing the graph and we 
don't want those to fail because they are not aware of a second config 
setting.





@@ -408,6 +408,7 @@ static struct cmd_struct commands[] = {
 { "fsck-objects", cmd_fsck, RUN_SETUP },
 { "gc", cmd_gc, RUN_SETUP },
 { "get-tar-commit-id", cmd_get_tar_commit_id },
+   { "graph", cmd_graph, RUN_SETUP_GENTLY },

Why gently, though?

 From reading the docs (and assumptions on further patches)
we'd want to abort if there is no .git dir to be found?

Or is a future patch having manual logic? (e.g. if pack-dir is
given, the command may be invoked from outside a git dir)


You are right. I inherited this from my MIDX patch which can operate on 
a list of IDX files without a .git folder. The commit graph operations 
need an ODB.


Thanks,
-Stolee



Re: [PATCH 06/14] packed-graph: implement git-graph --write

2018-01-26 Thread Derrick Stolee

On 1/25/2018 6:28 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote:


+
+$ git midx --write

midx?


Looks like I missed some replacements as I was building this. Now you 
see how I hope the feedback from this patch will inform the MIDX patch. ;)





+test_done

The tests basically tests that there is no segfault?
Makes sense.


Also checks that files are written based on the output hash. The next 
commits gives inspection ability.


Thanks,
-Stolee


Re: [PATCH 02/14] packed-graph: add core.graph setting

2018-01-26 Thread Derrick Stolee

On 1/25/2018 4:43 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


The packed graph feature is controlled by the new core.graph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.graph is that a user can always stop checking
for or parsing packed graph files if core.graph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/config.txt | 3 +++
  cache.h  | 1 +
  config.c | 5 +
  environment.c| 1 +
  4 files changed, 10 insertions(+)

Before you get too married to the name "graph", is it reasonable to
assume that the commit ancestry graph is the primary "graph" that
should come to users' minds when a simple word "graph" is used in
the context of discussing Git?  I suspect not.

Let's not just call this "core.graph" and "packed-graph", and in
addition give some adjective before "graph".


I was too focused that I wanted the word "graph" but "graph.c" already 
existed in source root that I came up with "packed-graph.c" just to have 
a separate filename. Clearly, "commit-graph" should be used instead. In 
v2, I'll use "/commit-graph.c" and "/builtin/commit-graph.c".


Thanks,
-Stolee


Re: [PATCH 04/14] packed-graph: add format document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 5:06 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


Add document specifying the binary format for packed graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/technical/graph-format.txt | 88 
  1 file changed, 88 insertions(+)
  create mode 100644 Documentation/technical/graph-format.txt

diff --git a/Documentation/technical/graph-format.txt 
b/Documentation/technical/graph-format.txt
new file mode 100644
index 00..a15e1036d7
--- /dev/null
+++ b/Documentation/technical/graph-format.txt
@@ -0,0 +1,88 @@
+Git commit graph format
+===

Good that this is not saying "graph format" but is explicit that it
is about "commit".  Do the same for the previous steps.  Especially,
builtin/graph.c that does not have much to do with graph.c is not a
good way forward ;-)


:+1:


I do like the fact that later parents of octopus merges are moved
out of way to make the majority of records fixed length, but I am
not sure if the "up to two parents are recorded in line" is truly
the best arrangement.  Aren't majority of commits single-parent,
thereby wasting 4 bytes almost always?

Will 32-bit stay to be enough for everybody?  Wouldn't it make sense
to at least define them to be indices into arrays (i.e. scaled to
element size), not "offsets", to recover a few lost bits?


I incorrectly used the word "offset" when I mean "array position" for 
the edge values.



What's the point of storing object id length?  If you do not
understand the object ID scheme, knowing only the length would not
do you much good anyway, no?  And if you know the hashing scheme
specified by Object ID version, you already know the length, no?


I'll go read the OID transition document to learn more, but I didn't 
know if there were plans for things like "Use SHA3 but with different 
hash lengths depending on user requirements". One side benefit is that 
we immediately know the width of our commit and tree references within 
the commit graph file without needing to consult a table of hash 
definitions.



On 1/25/2018 5:18 PM, Stefan Beller wrote:

git.git has ~37k non-merge commits and ~12k merge commits,
(35 of them have 3 or more parents).

So 75% would waste the 4 bytes of the second parent.

However the first parent is still there, so any operation that only needs
the first parent (git bisect --first-parent?) would still be fast.
Not sure if that is common.


The current API boundary does not support this, as parse_commit_gently() 
is not aware of the --first-parent option. The benefits of injecting 
that information are probably not worth the complication.


On 1/25/2018 5:29 PM, Junio C Hamano wrote:

Stefan Beller <sbel...@google.com> writes:


The downside of just having one parent or pointer into the edge list
would be to penalize 25% of the commit lookups with an indirection
compared to ~0% (the 35 octopus'). I'd rather want to optimize for
speed than disk size? (4 bytes for 37k is 145kB for git.git, which I
find is not a lot).

My comment is not about disk size but is about the size of working
set (or "size of array element").
I do want to optimize for speed over space, at least for two-parent 
commits. Hopefully my clarification about offset/array position 
clarifies Junio's concerns here.


Thanks,
-Stolee



Re: [PATCH 00/14] Serialized Commit Graph

2018-01-26 Thread Derrick Stolee

On 1/25/2018 6:06 PM, Ævar Arnfjörð Bjarmason wrote:

On Thu, Jan 25 2018, Derrick Stolee jotted:

Oops! This is my mistake. The correct command should be:

  git show-ref -s | git graph --write --update-head --stdin-commits

Without "--stdin-commits" the command will walk all packed objects
to look for commits and then build the graph. That's why it's taking
so long. That method takes several minutes on the Linux repo, but with
--stdin-commits it should take as long as "git log >/dev/null".

Thanks, it took around 15m to finish with the command I initially ran on
my test repo.

Then the `merge-base --is-ancestor` performance problem I was
complaining about in
https://public-inbox.org/git/87608bawoa@evledraar.gmail.com/ takes
around 1s with your series, 5s without it. Nice.


Thanks for testing this! May I ask how many commits are in your repo? 
One way to find out is to run 'git graph --read' and it will tell you 
how many commits are in the serialized graph.


Thanks,
-Stolee


Re: [PATCH 01/14] graph: add packed graph design document

2018-01-26 Thread Derrick Stolee

On 1/25/2018 3:04 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote:

Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/technical/packed-graph.txt | 185 +++
  1 file changed, 185 insertions(+)
  create mode 100644 Documentation/technical/packed-graph.txt

diff --git a/Documentation/technical/packed-graph.txt 
b/Documentation/technical/packed-graph.txt
new file mode 100644
index 00..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.

How did you come up with that specific number? (Is it platform dependent?)
I'd avoid a specific number to not derail the reader here into wondering
how this got measured.


Using a specific number was a mistake. Git can walk ~100K commits per 
second by parsing commits, in my tests on my machine. I'll instead say 
"commit count grows."



+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:

status needs the walk for the ahead/behind computation which is (1)?
(I forget how status would need to compute a merge-base)


'status' computes the ahead/behind counts using paint_down_to_common(). 
This is a more robust method than simply computing merge bases, but the 
possible merge bases are found as a result.



fetch is a networked command, which traditionally in Git is understood as
"can be slow" because you might be in Australia, or the connection is slow
otherwise. So giving this as an example it is not obvious that the DAG walking
is the bottleneck. Maybe git-merge or "git show --remerge-diff" [1] are better
examples for walk-intensive commands?

[1] https://public-inbox.org/git/cover.1409860234.git...@thomasrast.ch/
 never landed, so maybe that is a bad example. But for me that command
 is more obviously dependent on cheap walking the DAG compared to fetch.
 So, take my comments with a grain of salt!


Actually, a 'fetch' command does the same ahead/behind calculation as 
'status', and in GVFS repos we have seen that walk take 30s per branch 
when comparing local and remote copies a fast-moving branch. Yes, there 
are other (usually) more expensive things in 'fetch' so I'll drop that 
reference..



+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The packed graph is a file that stores the commit graph structure along
+with some extra metadata to speed up graph walks. This format allows a
+consumer to load the following info for a commit:
+
+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().


This new format is specifically removing the cost of decompression and parsing
(1) completely, whereas (2) we still have to walk the entire graph for now as
the generation numbers are not fully used as of yet, but provided.


A major goal of this work is to provide a place to store computed 
generation numbers so we can not walk the entire graph. I mention this 
here because 'git log -' is O(n) (due to commit-date heuristics that 
prevent walking the entire graph) while 'git log --topo-order -' is 
O(T) where T is the total number of reachable commits.



+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.

Does this mean we decrease the pressure on fast lookups in
packfiles/loose objects?


Yes, we do. In fact, when profiling 'git log --topo-order -1000', I 
noticed that 30-50% of the time (after this patch) is spent in 
lookup_tree(). If we can prevent checking the ODB for the existence of 
these trees until they are needed, we can get additional speedups. It is 
a bit wasteful that we are loading these trees even when we will never 
use them (such as computing merge bases).





+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a roo

[PATCH] packfile: use get_be64() for large offsets

2018-01-17 Thread Derrick Stolee
The pack-index version 2 format uses two 4-byte integers in network-byte order 
to represent one 8-byte value. The current implementation has several code 
clones for stitching these integers together.

Use get_be64() to create an 8-byte integer from two 4-byte integers represented 
this way.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 pack-revindex.c | 6 ++
 packfile.c  | 3 +--
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/pack-revindex.c b/pack-revindex.c
index 1b7ebd8d7e..ff5f62c033 100644
--- a/pack-revindex.c
+++ b/pack-revindex.c
@@ -134,10 +134,8 @@ static void create_pack_revindex(struct packed_git *p)
if (!(off & 0x8000)) {
p->revindex[i].offset = off;
} else {
-   p->revindex[i].offset =
-   ((uint64_t)ntohl(*off_64++)) << 32;
-   p->revindex[i].offset |=
-   ntohl(*off_64++);
+   p->revindex[i].offset = get_be64(off_64);
+   off_64 += 2;
}
p->revindex[i].nr = i;
}
diff --git a/packfile.c b/packfile.c
index 4a5fe7ab18..228ed0d59a 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1702,8 +1702,7 @@ off_t nth_packed_object_offset(const struct packed_git 
*p, uint32_t n)
return off;
index += p->num_objects * 4 + (off & 0x7fff) * 8;
check_pack_index_ptr(p, index);
-   return (((uint64_t)ntohl(*((uint32_t *)(index + 0 << 32) |
-  ntohl(*((uint32_t *)(index + 4)));
+   return get_be64(index);
}
 }
 
-- 
2.15.0



Re: [PATCH] sha1_file: remove static strbuf from sha1_file_name()

2018-01-16 Thread Derrick Stolee

On 1/16/2018 2:18 AM, Christian Couder wrote:

Using a static buffer in sha1_file_name() is error prone
and the performance improvements it gives are not needed
in most of the callers.

So let's get rid of this static buffer and, if necessary
or helpful, let's use one in the caller.


First: this is a good change for preventing bugs in the future. Do not 
let my next thought deter you from making this change.


Second: I wonder if there is any perf hit now that we are allocating 
buffers much more often. Also, how often does get_object_directory() 
change, so in some cases we could cache the buffer and only append the 
parts for the loose object (and not reallocate because the filenames 
will have equal length).


I'm concerned about the perf implications when inspecting many loose 
objects (100k+) but these code paths seem to be involved with more 
substantial work, such as opening and parsing the objects, so keeping a 
buffer in-memory is probably unnecessary.



---
  cache.h   |  8 +++-
  http-walker.c |  6 --
  http.c| 16 ++--
  sha1_file.c   | 38 +-
  4 files changed, 42 insertions(+), 26 deletions(-)

diff --git a/cache.h b/cache.h
index d8b975a571..6db565408e 100644
--- a/cache.h
+++ b/cache.h
@@ -957,12 +957,10 @@ extern void check_repository_format(void);
  #define TYPE_CHANGED0x0040
  
  /*

- * Return the name of the file in the local object database that would
- * be used to store a loose object with the specified sha1.  The
- * return value is a pointer to a statically allocated buffer that is
- * overwritten each time the function is called.
+ * Put in `buf` the name of the file in the local object database that
+ * would be used to store a loose object with the specified sha1.
   */
-extern const char *sha1_file_name(const unsigned char *sha1);
+extern void sha1_file_name(struct strbuf *buf, const unsigned char *sha1);
  
  /*

   * Return an abbreviated sha1 unique within this repository's object database.
diff --git a/http-walker.c b/http-walker.c
index 1ae8363de2..07c2b1af82 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -544,8 +544,10 @@ static int fetch_object(struct walker *walker, unsigned 
char *sha1)
} else if (hashcmp(obj_req->sha1, req->real_sha1)) {
ret = error("File %s has bad hash", hex);
} else if (req->rename < 0) {
-   ret = error("unable to write sha1 filename %s",
-   sha1_file_name(req->sha1));
+   struct strbuf buf = STRBUF_INIT;
+   sha1_file_name(, req->sha1);
+   ret = error("unable to write sha1 filename %s", buf.buf);
+   strbuf_release();
}
  
  	release_http_object_request(req);

diff --git a/http.c b/http.c
index 5977712712..5979305bc9 100644
--- a/http.c
+++ b/http.c
@@ -2168,7 +2168,7 @@ struct http_object_request *new_http_object_request(const 
char *base_url,
unsigned char *sha1)
  {
char *hex = sha1_to_hex(sha1);
-   const char *filename;
+   struct strbuf filename = STRBUF_INIT;
char prevfile[PATH_MAX];
int prevlocal;
char prev_buf[PREV_BUF_SIZE];
@@ -2180,14 +2180,15 @@ struct http_object_request 
*new_http_object_request(const char *base_url,
hashcpy(freq->sha1, sha1);
freq->localfile = -1;
  
-	filename = sha1_file_name(sha1);

+   sha1_file_name(, sha1);
snprintf(freq->tmpfile, sizeof(freq->tmpfile),
-"%s.temp", filename);
+"%s.temp", filename.buf);
  
-	snprintf(prevfile, sizeof(prevfile), "%s.prev", filename);

+   snprintf(prevfile, sizeof(prevfile), "%s.prev", filename.buf);
unlink_or_warn(prevfile);
rename(freq->tmpfile, prevfile);
unlink_or_warn(freq->tmpfile);
+   strbuf_release();
  
  	if (freq->localfile != -1)

error("fd leakage in start: %d", freq->localfile);
@@ -2302,6 +2303,7 @@ void process_http_object_request(struct 
http_object_request *freq)
  int finish_http_object_request(struct http_object_request *freq)
  {
struct stat st;
+   struct strbuf filename = STRBUF_INIT;
  
  	close(freq->localfile);

freq->localfile = -1;
@@ -2327,8 +2329,10 @@ int finish_http_object_request(struct 
http_object_request *freq)
unlink_or_warn(freq->tmpfile);
return -1;
}
-   freq->rename =
-   finalize_object_file(freq->tmpfile, sha1_file_name(freq->sha1));
+
+   sha1_file_name(, freq->sha1);
+   freq->rename = finalize_object_file(freq->tmpfile, filename.buf);
+   strbuf_release();
  
  	return freq->rename;

  }
diff --git a/sha1_file.c b/sha1_file.c
index 3da70ac650..f66c21b2da 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -321,15 +321,11 @@ static void fill_sha1_path(struct strbuf *buf, const 
unsigned char *sha1)
}
  }
  
-const char *sha1_file_name(const unsigned char *sha1)

+void sha1_file_name(struct 

Re: [PATCH] describe: use strbuf_add_unique_abbrev() for adding short hashes

2018-01-16 Thread Derrick Stolee

On 1/15/2018 12:10 PM, René Scharfe wrote:

Call strbuf_add_unique_abbrev() to add an abbreviated hash to a strbuf
instead of taking a detour through find_unique_abbrev() and its static
buffer.  This is shorter and a bit more efficient.

Patch generated by Coccinelle (and contrib/coccinelle/strbuf.cocci).

Signed-off-by: Rene Scharfe 
---
The changed line was added by 4dbc59a4cc (builtin/describe.c: factor
out describe_commit).

"make coccicheck" doesn't propose any other changes for current master.

  builtin/describe.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/describe.c b/builtin/describe.c
index 3b0b204b1e..21e37f5dae 100644
--- a/builtin/describe.c
+++ b/builtin/describe.c
@@ -380,7 +380,7 @@ static void describe_commit(struct object_id *oid, struct 
strbuf *dst)
if (!match_cnt) {
struct object_id *cmit_oid = >object.oid;
if (always) {
-   strbuf_addstr(dst, find_unique_abbrev(cmit_oid->hash, 
abbrev));
+   strbuf_add_unique_abbrev(dst, cmit_oid->hash, abbrev);
if (suffix)
strbuf_addstr(dst, suffix);
return;


René,

Thanks for this cleanup! I just learned about strbuf_add_unique_abbrev() 
and like that it uses the reentrant find_unique_abbrev_r() instead.


Looks good to me.
-Stolee



Re: [PATCH 02/14] packed-graph: add core.graph setting

2018-01-25 Thread Derrick Stolee

On 1/25/2018 3:17 PM, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 6:02 AM, Derrick Stolee <sto...@gmail.com> wrote:

The packed graph feature is controlled by the new core.graph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.graph is that a user can always stop checking
for or parsing packed graph files if core.graph=0.

@@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
  extern int fsync_object_files;
  extern int core_preload_index;
  extern int core_apply_sparse_checkout;
+extern int core_graph;

Putting it here instead of say the_repository makes sense as you'd want
to use this feature globally. However you can still have the config
different per repository  (e.g. version number of the graph setting,
as one might be optimized for speed and the other for file size of
the .graph file or such).

So not sure if we'd rather want to put this into the repository struct.
But then again the other core settings aren't there either and this
feature sounds like it is repository specific only in the experimental
phase; later it is expected to be on everywhere?


I do think that more things should go in the repository struct. 
Unfortunately, that is not the world we live in.


However, to make things clearer I'm following the pattern currently in 
master. You'll see the same with the global 'packed_graph' pointer, 
similar to 'packed_git'. I think these should be paired together until 
the repository absorbs them.


If other 'core_*' variables move to the repository, I'm happy to move 
core_graph.

If 'packed_git' moves to the repository, I'm happy to move 'packed_git'.

However, if there is significant interest in moving all new state to the 
repository, then I'll move these values there. Let's have that 
discussion here instead of spread around the rest of the patch.


Thanks,
-Stolee



Re: [PATCH 00/14] Serialized Commit Graph

2018-01-25 Thread Derrick Stolee

On 1/25/2018 10:46 AM, Ævar Arnfjörð Bjarmason wrote:

On Thu, Jan 25 2018, Derrick Stolee jotted:


* 'git log --topo-order -1000' walks all reachable commits to avoid
   incorrect topological orders, but only needs the commit message for
   the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
   boundary between the commits reachable from A and those reachable
   from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
   compared to their upstream remote branches. This is essentially as
   hard as computing merge bases for each.

This is great, spotted / questions so far:

* git graph --blah says you need to enable the config, should say
   "unknown option --blah ". I.e. overzelous config guard.


This is a good point.


* On a big repo (git show-ref -s | ~/g/git/git-graph --write
   --update-head) is as of writing this still hanging for me, but strace
   shows it's brk()-ing. Presumably just still busy, a progress bar would
   be very nice.


Oops! This is my mistake. The correct command should be:

    git show-ref -s | git graph --write --update-head --stdin-commits

Without "--stdin-commits" the command will walk all packed objects
to look for commits and then build the graph. That's why it's taking
so long. That method takes several minutes on the Linux repo, but with
--stdin-commits it should take as long as "git log >/dev/null".


* Shouldn't there be a pack.useGraph option so this gets auto-updated on
   repack? I understand this series is a WIP, so that's more a "is that
   the UI" than "it needs now".


This will definitely be part of a follow-up patch.

Thanks,
-Stolee


[PATCH 00/14] Serialized Commit Graph

2018-01-25 Thread Derrick Stolee
As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

git config core.graph true
git show-ref -s | git graph --write --update-head

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.graph' setting.

[1] 
https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/
Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

P.S. I'm sending this patch from my gmail address to avoid Outlook
munging the URLs included in the design document.

Derrick Stolee (14):
  graph: add packed graph design document
  packed-graph: add core.graph setting
  packed-graph: create git-graph builtin
  packed-graph: add format document
  packed-graph: implement construct_graph()
  packed-graph: implement git-graph --write
  packed-graph: implement git-graph --read
  graph: implement git-graph --update-head
  packed-graph: implement git-graph --clear
  packed-graph: teach git-graph --delete-expired
  commit: integrate packed graph with commit parsing
  packed-graph: read only from specific pack-indexes
  packed-graph: close under reachability
  packed-graph: teach git-graph to read commits

 Documentation/config.txt |   3 +
 Documentation/git-graph.txt  | 102 
 Documentation/technical/graph-format.txt |  88 
 Documentation/technical/packed-graph.txt | 185 +++
 Makefile |   2 +
 alloc.c  |   1 +
 builtin.h|   1 +
 builtin/graph.c  | 231 +
 cache.h  |   1 +
 command-list.txt |   1 +
 commit.c |  20 +-
 commit.h |   2 +
 config.c |   5 +
 environment.c|   1 +
 git.c|   1 +
 log-tree.c   |   3 +-
 packed-graph.c   | 840 +++
 packed-graph.h   |  65 +++
 packfile.c   |   4 +-
 packfile.h   |   2 +
 t/t5319-graph.sh | 271 ++
 21 files changed, 1822 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/git-graph.txt
 create mode 100644 Documentation/technical/graph-format.txt
 create mode 100644 Documentation/technical/packed-graph.txt
 create mode 100644 builtin/graph.c
 create mode 100644 packed-graph.c
 create mode 100644 packed-graph.h
 create mode 100755 t/t5319-graph.sh

[PATCH 04/14] packed-graph: add format document

2018-01-25 Thread Derrick Stolee
Add document specifying the binary format for packed graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/graph-format.txt | 88 
 1 file changed, 88 insertions(+)
 create mode 100644 Documentation/technical/graph-format.txt

diff --git a/Documentation/technical/graph-format.txt 
b/Documentation/technical/graph-format.txt
new file mode 100644
index 00..a15e1036d7
--- /dev/null
+++ b/Documentation/technical/graph-format.txt
@@ -0,0 +1,88 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+   4-byte signature:
+   The signature is: {'C', 'G', 'P', 'H'}
+
+   1-byte version number:
+   Currently, the only valid version is 1.
+
+   1-byte Object Id Version (1 = SHA-1)
+
+   1-byte Object Id Length (H)
+
+   1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+   (C + 1) * 12 bytes listing the table of contents for the chunks:
+   First 4 bytes describe chunk id. Value 0 is a terminating label.
+   Other 8 bytes provide offset in current file for chunk to start.
+   (Chunks are ordered contiguously in the file, so you can infer
+   the length using the next chunk position if necessary.)
+
+   The remaining data in the body is described one chunk at a time, and
+   these chunks may be given in any order. Chunks are required unless
+   otherwise specified.
+
+CHUNK DATA:
+
+   OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+   The ith entry, F[i], stores the number of OIDs with first
+   byte at most i. Thus F[255] stores the total
+   number of commits (N).
+
+   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+   The OIDs for all commits in the graph.
+
+   Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+   * The first H bytes are for the OID of the root tree.
+   * The next 8 bytes are for the int-ids of the first two parents of
+ the ith commit. Stores value 0x if no parent in that 
position.
+ If there are more than two parents, the second value has its most-
+ significant bit on and the other bits store an offset into the 
Large
+ Edge List chunk.
+   * The next 8 bytes store the generation number of the commit and the
+ commit time in seconds since EPOCH. The generation number uses the
+ higher 30 bits of the first 4 bytes, while the commit time uses 
the
+ 32 bits of the second 4 bytes, along with the lowest 2 bits of the
+ lowest byte, storing the 33rd and 34th bit of the commit time.
+
+   [Optional] Large Edge List (ID: {'E', 'D', 'G', 'E'})
+   This list of 4-byte values store the second through nth parents for
+   all octoput merges. The second parent value in the commit data is a
+   negative number pointing into this list. Then iterate through this
+   list starting at that position until reaching a value with the most-
+   significant bit on. The other bits correspond to the int-id of the
+   last parent.
+
+TRAILER:
+
+   H-byte HASH-checksum of all of the above.
-- 
2.16.0



[PATCH 02/14] packed-graph: add core.graph setting

2018-01-25 Thread Derrick Stolee
The packed graph feature is controlled by the new core.graph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.graph is that a user can always stop checking
for or parsing packed graph files if core.graph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..e7b98fa14f 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.graph::
+   Enable git commit graph feature. Allows writing and reading from .graph 
files.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index d8b975a571..655a81ac90 100644
--- a/cache.h
+++ b/cache.h
@@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_apply_sparse_checkout;
+extern int core_graph;
 extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
diff --git a/config.c b/config.c
index e617c2018d..fee90912d8 100644
--- a/config.c
+++ b/config.c
@@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.graph")) {
+   core_graph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..0c56a3d869 100644
--- a/environment.c
+++ b/environment.c
@@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = 
OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
+int core_graph;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
-- 
2.16.0



[PATCH 01/14] graph: add packed graph design document

2018-01-25 Thread Derrick Stolee
Add Documentation/technical/packed-graph.txt with details of the planned
packed graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/packed-graph.txt | 185 +++
 1 file changed, 185 insertions(+)
 create mode 100644 Documentation/technical/packed-graph.txt

diff --git a/Documentation/technical/packed-graph.txt 
b/Documentation/technical/packed-graph.txt
new file mode 100644
index 00..fcc0c83874
--- /dev/null
+++ b/Documentation/technical/packed-graph.txt
@@ -0,0 +1,185 @@
+Git Packed Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows above 100K.
+The merge base calculation shows up in many user-facing commands, such
+as 'status' and 'fetch' and can take minutes to compute depending on
+data shape. There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The packed graph is a file that stores the commit graph structure along
+with some extra metadata to speed up graph walks. This format allows a
+consumer to load the following info for a commit:
+
+1. The commit OID.
+2. The list of parents.
+3. The commit date.
+4. The root tree OID.
+5. An integer ID for fast lookups in the graph.
+6. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+By providing an integer ID we can avoid lookups in the graph as we walk
+commits. Specifically, we need to provide the integer ID of the parent
+commits so we navigate directly to their information on request.
+
+Define the "generation number" of a commit recursively as follows:
+ * A commit with no parents (a root commit) has generation number 1.
+ * A commit with at least one parent has generation number 1 more than
+   the largest generation number among its parents.
+Equivalently, the generation number is one more than the length of a
+longest path from the commit to a root commit. The recursive definition
+is easier to use for computation and the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+Design Details
+--
+
+- A graph file is stored in a file named 'graph-.graph' in the pack
+  directory. This could be stored in an alternate.
+
+- The most-recent graph file OID is stored in a 'graph-head' file for
+  immediate access and storing backup graphs. This could be stored in an
+  alternate, and refers to a 'graph-.graph' file in the same pack
+  directory.
+
+- The core.graph config setting must be on to create or consume graph files.
+
+- The graph file is only a supplemental structure. If a user downgrades
+  or disables the 'core.graph' config setting, then the existing ODB is
+  sufficient.
+
+- The file format includes parameters for the object id length
+  and hash algorithm, so a future change of hash algorithm does
+  not require a change in format.
+
+Current Limitations
+---
+
+- Only one graph file is used at one time. This allows the integer ID to
+  seek into the single graph file. It is possible to extend the model
+  for multiple graph files, but that is currently not part of the design.
+
+- .graph files are managed only by the 'graph' builtin. These are not
+  updated automatically during clone or fetch.
+
+- There is no '--verify' option for the 'graph' builtin to verify the
+  contents of the graph file.
+
+- The graph only considers commits existing in packfiles and does not
+  walk to fill in reachable commits. [Small]
+
+- When rewriting the graph, we do not check for a commit still existing
+  in

[PATCH 03/14] packed-graph: create git-graph builtin

2018-01-25 Thread Derrick Stolee
Teach Git the 'graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a check that the core.graph setting is enabled
and a '--pack-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt |  7 +++
 Makefile|  1 +
 builtin.h   |  1 +
 builtin/graph.c | 36 
 command-list.txt|  1 +
 git.c   |  1 +
 6 files changed, 47 insertions(+)
 create mode 100644 Documentation/git-graph.txt
 create mode 100644 builtin/graph.c

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
new file mode 100644
index 00..de5a3c07e6
--- /dev/null
+++ b/Documentation/git-graph.txt
@@ -0,0 +1,7 @@
+git-graph(1)
+
+
+NAME
+
+git-graph - Write and verify Git commit graphs (.graph files)
+
diff --git a/Makefile b/Makefile
index 1a9b23b679..d8b0d0457a 100644
--- a/Makefile
+++ b/Makefile
@@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o
 BUILTIN_OBJS += builtin/fsck.o
 BUILTIN_OBJS += builtin/gc.o
 BUILTIN_OBJS += builtin/get-tar-commit-id.o
+BUILTIN_OBJS += builtin/graph.o
 BUILTIN_OBJS += builtin/grep.o
 BUILTIN_OBJS += builtin/hash-object.o
 BUILTIN_OBJS += builtin/help.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..ae7e816908 100644
--- a/builtin.h
+++ b/builtin.h
@@ -168,6 +168,7 @@ extern int cmd_format_patch(int argc, const char **argv, 
const char *prefix);
 extern int cmd_fsck(int argc, const char **argv, const char *prefix);
 extern int cmd_gc(int argc, const char **argv, const char *prefix);
 extern int cmd_get_tar_commit_id(int argc, const char **argv, const char 
*prefix);
+extern int cmd_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_grep(int argc, const char **argv, const char *prefix);
 extern int cmd_hash_object(int argc, const char **argv, const char *prefix);
 extern int cmd_help(int argc, const char **argv, const char *prefix);
diff --git a/builtin/graph.c b/builtin/graph.c
new file mode 100644
index 00..a902dc8646
--- /dev/null
+++ b/builtin/graph.c
@@ -0,0 +1,36 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "packfile.h"
+#include "parse-options.h"
+
+static char const * const builtin_graph_usage[] ={
+   N_("git graph [--pack-dir ]"),
+   NULL
+};
+
+static struct opts_graph {
+   const char *pack_dir;
+} opts;
+
+int cmd_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_graph_options[] = {
+   { OPTION_STRING, 'p', "pack-dir", _dir,
+   N_("dir"),
+   N_("The pack directory to store the graph") },
+   OPT_END(),
+   };
+
+   if (!core_graph)
+   die("core.graph is false");
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_graph_usage, builtin_graph_options);
+
+   return 0;
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..d9c17cb9f8 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -61,6 +61,7 @@ git-format-patchmainporcelain
 git-fsckancillaryinterrogators
 git-gc  mainporcelain
 git-get-tar-commit-id   ancillaryinterrogators
+git-graph   plumbingmanipulators
 git-grepmainporcelain   info
 git-gui mainporcelain
 git-hash-object plumbingmanipulators
diff --git a/git.c b/git.c
index c870b9719c..29f8b6e7dd 100644
--- a/git.c
+++ b/git.c
@@ -408,6 +408,7 @@ static struct cmd_struct commands[] = {
{ "fsck-objects", cmd_fsck, RUN_SETUP },
{ "gc", cmd_gc, RUN_SETUP },
{ "get-tar-commit-id", cmd_get_tar_commit_id },
+   { "graph", cmd_graph, RUN_SETUP_GENTLY },
{ "grep", cmd_grep, RUN_SETUP_GENTLY },
{ "hash-object", cmd_hash_object },
{ "help", cmd_help },
-- 
2.16.0



[PATCH 09/14] packed-graph: implement git-graph --clear

2018-01-25 Thread Derrick Stolee
Teach Git to delete the current 'graph_head' file and the packed graph
it references. This is a good safety valve if somehow the file is
corrupted and needs to be recalculated. Since the packed graph is a
summary of contents already in the ODB, it can be regenerated.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt | 16 ++--
 builtin/graph.c | 31 ++-
 t/t5319-graph.sh|  7 ++-
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index ac20aa67a9..f690699570 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -11,6 +11,7 @@ SYNOPSIS
 [verse]
 'git graph' --write  [--pack-dir ]
 'git graph' --read  [--pack-dir ]
+'git graph' --clear [--pack-dir ]
 
 OPTIONS
 ---
@@ -18,16 +19,21 @@ OPTIONS
Use given directory for the location of packfiles, graph-head,
and graph files.
 
+--clear::
+   Delete the graph-head file and the graph file it references.
+   (Cannot be combined with --read or --write.)
+
 --read::
Read a graph file given by the graph-head file and output basic
-   details about the graph file. (Cannot be combined with --write.)
+   details about the graph file. (Cannot be combined with --clear
+   or --write.)
 
 --graph-id::
When used with --read, consider the graph file graph-.graph.
 
 --write::
Write a new graph file to the pack directory. (Cannot be combined
-   with --read.)
+   with --clear or --read.)
 
 --update-head::
When used with --write, update the graph-head file to point to
@@ -61,6 +67,12 @@ $ git graph --write --update-head
 $ git graph --read --graph-id=
 
 
+* Delete /graph-head and the file it references.
++
+
+$ git graph --clear --pack-dir=
+
+
 CONFIGURATION
 -
 
diff --git a/builtin/graph.c b/builtin/graph.c
index 0760d99f43..ac15febc46 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -10,6 +10,7 @@
 
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
+   N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
N_("git graph --write [--pack-dir ] [--update-head]"),
NULL
@@ -17,6 +18,7 @@ static char const * const builtin_graph_usage[] ={
 
 static struct opts_graph {
const char *pack_dir;
+   int clear;
int read;
const char *graph_id;
int write;
@@ -25,6 +27,29 @@ static struct opts_graph {
struct object_id old_graph_oid;
 } opts;
 
+static int graph_clear(void)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   char *old_path;
+
+   if (!opts.has_existing)
+   return 0;
+
+   strbuf_addstr(_path, opts.pack_dir);
+   strbuf_addstr(_path, "/");
+   strbuf_addstr(_path, "graph-head");
+   if (remove_path(head_path.buf))
+   die("failed to remove path %s", head_path.buf);
+   strbuf_release(_path);
+
+   old_path = get_graph_filename_oid(opts.pack_dir, _graph_oid);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+   free(old_path);
+
+   return 0;
+}
+
 static int graph_read(void)
 {
struct object_id graph_oid;
@@ -105,6 +130,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('c', "clear", ,
+   N_("clear graph file and graph-head")),
OPT_BOOL('r', "read", ,
N_("read graph file")),
OPT_BOOL('w', "write", ,
@@ -129,7 +156,7 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
 builtin_graph_options,
 builtin_graph_usage, 0);
 
-   if (opts.write + opts.read > 1)
+   if (opts.write + opts.read + opts.clear > 1)
usage_with_options(builtin_graph_usage, builtin_graph_options);
 
if (!opts.pack_dir) {
@@ -141,6 +168,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
 
opts.has_existing = !!get_graph_head_oid(opts.pack_dir, 
_graph_oid);
 
+   if (opts.clear)
+   return graph_clear();
if (opts.read)
return graph_read();
if (opts.write)
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
index 3919a3ad73..311fb9dd67 100755
--- a/t/t5319-graph.sh
+++ b/t/t5319-graph.sh
@@ -80,6 +80,11 @@ t

[PATCH 08/14] graph: implement git-graph --update-head

2018-01-25 Thread Derrick Stolee
It is possible to have multiple packed graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-graph to write 'graph_head' upon
writing a new packed graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt | 38 --
 builtin/graph.c | 38 +++---
 packed-graph.c  | 25 +
 packed-graph.h  |  1 +
 t/t5319-graph.sh| 12 ++--
 5 files changed, 107 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index 0939c3f1be..ac20aa67a9 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -12,19 +12,53 @@ SYNOPSIS
 'git graph' --write  [--pack-dir ]
 'git graph' --read  [--pack-dir ]
 
+OPTIONS
+---
+--pack-dir::
+   Use given directory for the location of packfiles, graph-head,
+   and graph files.
+
+--read::
+   Read a graph file given by the graph-head file and output basic
+   details about the graph file. (Cannot be combined with --write.)
+
+--graph-id::
+   When used with --read, consider the graph file graph-.graph.
+
+--write::
+   Write a new graph file to the pack directory. (Cannot be combined
+   with --read.)
+
+--update-head::
+   When used with --write, update the graph-head file to point to
+   the written graph file.
+
 EXAMPLES
 
 
+* Output the OID of the graph file pointed to by /graph-head.
++
+
+$ git graph --pack-dir=
+
+
 * Write a graph file for the packed commits in your local .git folder.
 +
 
-$ git midx --write
+$ git graph --write
+
+
+* Write a graph file for the packed commits in your local .git folder,
+* and update graph-head.
++
+
+$ git graph --write --update-head
 
 
 * Read basic information from a graph file.
 +
 
-$ git midx --read --graph-id=
+$ git graph --read --graph-id=
 
 
 CONFIGURATION
diff --git a/builtin/graph.c b/builtin/graph.c
index bc66722924..0760d99f43 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -11,7 +11,7 @@
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ]"),
+   N_("git graph --write [--pack-dir ] [--update-head]"),
NULL
 };
 
@@ -20,6 +20,9 @@ static struct opts_graph {
int read;
const char *graph_id;
int write;
+   int update_head;
+   int has_existing;
+   struct object_id old_graph_oid;
 } opts;
 
 static int graph_read(void)
@@ -30,8 +33,8 @@ static int graph_read(void)
 
if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ)
get_oid_hex(opts.graph_id, _oid);
-   else
-   die("no graph id specified");
+   else if (!get_graph_head_oid(opts.pack_dir, _oid))
+   die("no graph-head exists.");
 
graph_file = get_graph_filename_oid(opts.pack_dir, _oid);
graph = load_packed_graph_one(graph_file, opts.pack_dir);
@@ -62,10 +65,33 @@ static int graph_read(void)
return 0;
 }
 
+static void update_head_file(const char *pack_dir, const struct object_id 
*graph_id)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   int fd;
+   struct lock_file lk = LOCK_INIT;
+
+   strbuf_addstr(_path, pack_dir);
+   strbuf_addstr(_path, "/");
+   strbuf_addstr(_path, "graph-head");
+
+   fd = hold_lock_file_for_update(, head_path.buf, LOCK_DIE_ON_ERROR);
+   strbuf_release(_path);
+
+   if (fd < 0)
+   die_errno("unable to open graph-head");
+
+   write_in_full(fd, oid_to_hex(graph_id), GIT_MAX_HEXSZ);
+   commit_lock_file();
+}
+
 static int graph_write(void)
 {
struct object_id *graph_id = construct_graph(opts.pack_dir);
 
+   if (opts.update_head)
+   update_head_file(opts.pack_dir, graph_id);
+
if (graph_id)
printf("%s\n", oid_to_hex(graph_id));
 
@@ -83,6 +109,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("read graph file")),
OPT_BOOL('w', "write", ,
N_("write graph file")),
+   OPT_BOOL('u', "update-head", _head,
+   N_("update graph-head to written graph file")),

[PATCH 06/14] packed-graph: implement git-graph --write

2018-01-25 Thread Derrick Stolee
Teach git-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt | 26 ++
 builtin/graph.c | 37 ++--
 t/t5319-graph.sh| 83 +
 3 files changed, 143 insertions(+), 3 deletions(-)
 create mode 100755 t/t5319-graph.sh

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index de5a3c07e6..be6bc38814 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -5,3 +5,29 @@ NAME
 
 git-graph - Write and verify Git commit graphs (.graph files)
 
+
+SYNOPSIS
+
+[verse]
+'git graph' --write  [--pack-dir ]
+
+EXAMPLES
+
+
+* Write a graph file for the packed commits in your local .git folder.
++
+
+$ git midx --write
+
+
+CONFIGURATION
+-
+
+core.graph::
+   The graph command will fail if core.graph is false.
+   Also, the written graph files will be ignored by other commands
+   unless core.graph is true.
+
+GIT
+---
+Part of the linkgit:git[1] suite
\ No newline at end of file
diff --git a/builtin/graph.c b/builtin/graph.c
index a902dc8646..09f5552338 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -6,31 +6,62 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "parse-options.h"
+#include "packed-graph.h"
 
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
+   N_("git graph --write [--pack-dir ]"),
NULL
 };
 
 static struct opts_graph {
const char *pack_dir;
+   int write;
 } opts;
 
+static int graph_write(void)
+{
+   struct object_id *graph_id = construct_graph(opts.pack_dir);
+
+   if (graph_id)
+   printf("%s\n", oid_to_hex(graph_id));
+
+   free(graph_id);
+   return 0;
+}
+
 int cmd_graph(int argc, const char **argv, const char *prefix)
 {
static struct option builtin_graph_options[] = {
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('w', "write", ,
+   N_("write graph file")),
OPT_END(),
};
 
-   if (!core_graph)
-   die("core.graph is false");
-
if (argc == 2 && !strcmp(argv[1], "-h"))
usage_with_options(builtin_graph_usage, builtin_graph_options);
 
+   git_config(git_default_config, NULL);
+   if (!core_graph)
+   die("git-graph requires core.graph=true.");
+
+   argc = parse_options(argc, argv, prefix,
+builtin_graph_options,
+builtin_graph_usage, 0);
+
+   if (!opts.pack_dir) {
+   struct strbuf path = STRBUF_INIT;
+   strbuf_addstr(, get_object_directory());
+   strbuf_addstr(, "/pack");
+   opts.pack_dir = strbuf_detach(, NULL);
+   }
+
+   if (opts.write)
+   return graph_write();
+
return 0;
 }
 
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
new file mode 100755
index 00..52e979dfd3
--- /dev/null
+++ b/t/t5319-graph.sh
@@ -0,0 +1,83 @@
+#!/bin/sh
+
+test_description='packed graph'
+. ./test-lib.sh
+
+test_expect_success \
+'setup full repo' \
+'rm -rf .git &&
+ mkdir full &&
+ cd full &&
+ git init &&
+ git config core.graph true &&
+ git config pack.threads 1 &&
+ packdir=".git/objects/pack"'
+
+test_expect_success \
+'write graph with no packs' \
+'git graph --write --pack-dir .'
+
+test_expect_success \
+'create commits and repack' \
+'for i in $(test_seq 5)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+test_expect_success \
+'write graph' \
+'graph1=$(git graph --write) &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph'
+
+test_expect_success \
+'Add more commits' \
+'git reset --hard commits/3 &&
+ for i in $(test_seq 6 10)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git reset --hard commits/7 &&
+ for i in $(test_seq 11 15)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i&qu

[PATCH 14/14] packed-graph: teach git-graph to read commits

2018-01-25 Thread Derrick Stolee
Teach git-graph to read commits from stdin when the --stdin-commits
flag is specified. Commits reachable from these commits are added to
the graph. This is a much faster way to construct the graph than
inspecting all packed objects, but is restricted to known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/graph.c  | 33 +
 packed-graph.c   | 18 +++---
 packed-graph.h   |  3 ++-
 t/t5319-graph.sh | 18 ++
 4 files changed, 60 insertions(+), 12 deletions(-)

diff --git a/builtin/graph.c b/builtin/graph.c
index 3cace3a18c..708889677b 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs]"),
+   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -25,6 +25,7 @@ static struct opts_graph {
int update_head;
int delete_expired;
int stdin_packs;
+   int stdin_commits;
int has_existing;
struct object_id old_graph_oid;
 } opts;
@@ -116,22 +117,36 @@ static int graph_write(void)
 {
struct object_id *graph_id;
char **pack_indexes = NULL;
+   char **commits = NULL;
int num_packs = 0;
-   int size_packs = 0;
+   int num_commits = 0;
+   char **lines = NULL;
+   int num_lines = 0;
+   int size_lines = 0;
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
-   size_packs = 128;
-   ALLOC_ARRAY(pack_indexes, size_packs);
+   size_lines = 128;
+   ALLOC_ARRAY(lines, size_lines);
 
while (strbuf_getline(, stdin) != EOF) {
-   ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
-   pack_indexes[num_packs++] = buf.buf;
+   ALLOC_GROW(lines, num_lines + 1, size_lines);
+   lines[num_lines++] = buf.buf;
strbuf_detach(, NULL);
}
+
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+   num_packs = num_lines;
+   }
+   if (opts.stdin_commits) {
+   commits = lines;
+   num_commits = num_lines;
+   }
}
 
-   graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs);
+   graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs,
+  commits, num_commits);
 
if (opts.update_head)
update_head_file(opts.pack_dir, graph_id);
@@ -170,6 +185,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("delete expired head graph file")),
OPT_BOOL('s', "stdin-packs", _packs,
N_("only scan packfiles listed by stdin")),
+   OPT_BOOL('C', "stdin-commits", _commits,
+   N_("start walk at commits listed by stdin")),
{ OPTION_STRING, 'G', "graph-id", _id,
N_("oid"),
N_("An OID for a specific graph file in the pack-dir."),
diff --git a/packed-graph.c b/packed-graph.c
index c93515f18e..94e1a97000 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -662,7 +662,8 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
-struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, 
int nr_packs)
+struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, 
int nr_packs,
+ char **commit_hex, int nr_commits)
 {
// Find a list of oids, adding the pointer to a list.
struct packed_oid_list oids;
@@ -719,10 +720,21 @@ struct object_id *construct_graph(const char *pack_dir, 
char **pack_indexes, int
for_each_object_in_pack(p, 
if_packed_commit_add_to_list, );
close_pack(p);
}
-   } else {
-   for_each_packed_object(if_packed_commit_add_to_list, , 0);
}
 
+   if (commit_hex) {
+   for (i = 0; i < nr_commits; i++) {
+   const char *end;
+   ALLOC_GROW(oids.list, oids.num + 1, oids.size);
+   

[PATCH 05/14] packed-graph: implement construct_graph()

2018-01-25 Thread Derrick Stolee
Teach Git to write a packed graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |   1 +
 packed-graph.c | 375 +
 packed-graph.h |  20 +++
 3 files changed, 396 insertions(+)
 create mode 100644 packed-graph.c
 create mode 100644 packed-graph.h

diff --git a/Makefile b/Makefile
index d8b0d0457a..59439e13a1 100644
--- a/Makefile
+++ b/Makefile
@@ -841,6 +841,7 @@ LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
 LIB_OBJS += oidmap.o
 LIB_OBJS += oidset.o
+LIB_OBJS += packed-graph.o
 LIB_OBJS += packfile.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-bitmap-write.o
diff --git a/packed-graph.c b/packed-graph.c
new file mode 100644
index 00..9be9811667
--- /dev/null
+++ b/packed-graph.c
@@ -0,0 +1,375 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "packed-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 20
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_LARGE_EDGES_NEEDED 0x8000
+#define GRAPH_PARENT_MISSING 0x7fff
+#define GRAPH_EDGE_LAST_MASK 0x7fff
+#define GRAPH_PARENT_NONE 0x7000
+
+#define GRAPH_LAST_EDGE 0x8000
+
+char* get_graph_filename_oid(const char *pack_dir,
+ struct object_id *oid)
+{
+   size_t len;
+   struct strbuf head_path = STRBUF_INIT;
+   strbuf_addstr(_path, pack_dir);
+   strbuf_addstr(_path, "/graph-");
+   strbuf_addstr(_path, oid_to_hex(oid));
+   strbuf_addstr(_path, ".graph");
+
+   return strbuf_detach(_path, );
+}
+
+static void write_graph_chunk_fanout(
+   struct sha1file *f,
+   struct commit **commits, int nr_commits)
+{
+   uint32_t i, count = 0;
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+
+   /*
+   * Write the first-level table (the list is sorted,
+   * but we use a 256-entry lookup to be able to avoid
+   * having to do eight extra binary search iterations).
+   */
+   for (i = 0; i < 256; i++) {
+   uint32_t swap_count;
+
+   while (list < last) {
+   if ((*list)->object.oid.hash[0] != i)
+   break;
+   count++;
+   list++;
+   }
+
+   swap_count = htonl(count);
+   sha1write(f, _count, 4);
+   }
+}
+
+static void write_graph_chunk_oids(
+   struct sha1file *f, int hash_len,
+   struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   uint32_t i;
+   for (i = 0; i < nr_commits; i++) {
+   sha1write(f, (*list)->object.oid.hash, (int)hash_len);
+   list++;
+   }
+}
+
+static int commit_pos(struct commit **commits, int nr_commits, const struct 
object_id *oid, uint32_t *pos)
+{
+   uint32_t first = 0, last = nr_commits;
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   struct object_id *current;
+   int cmp;
+
+   current = &(commits[mid]->object.oid);
+   cmp = oidcmp(oid, current);
+   if (!cmp) {
+   *pos = mid;
+   return 1;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   *pos = first;
+   return 0;
+}
+
+static void write_graph_chunk_data(
+   struct sha1file *f, int hash_len,
+   struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+   uint32_t num_large_edges = 0;
+
+   while (list < last) {
+   struct commit_list *parent;
+   uint32_t intId, swapIntId;
+   uint32_t packedDate[2];
+
+   parse_commit(*list);
+   sha1write(f, (*list)->tree->object.oid.hash, hash_len);
+
+   parent = (*list)->parents;
+
+   if (!parent)
+   sw

[PATCH 07/14] packed-graph: implement git-graph --read

2018-01-25 Thread Derrick Stolee
Teach git-graph to read packed graph files and summarize their contents.

Use the --read option to verify the contents of a graph file in the
graph tests.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt |   7 +++
 builtin/graph.c |  54 
 packed-graph.c  | 147 +++-
 packed-graph.h  |  25 
 t/t5319-graph.sh|  50 +--
 5 files changed, 260 insertions(+), 23 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index be6bc38814..0939c3f1be 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git graph' --write  [--pack-dir ]
+'git graph' --read  [--pack-dir ]
 
 EXAMPLES
 
@@ -20,6 +21,12 @@ EXAMPLES
 $ git midx --write
 
 
+* Read basic information from a graph file.
++
+
+$ git midx --read --graph-id=
+
+
 CONFIGURATION
 -
 
diff --git a/builtin/graph.c b/builtin/graph.c
index 09f5552338..bc66722924 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -10,15 +10,58 @@
 
 static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
+   N_("git graph --read [--graph-id=]"),
N_("git graph --write [--pack-dir ]"),
NULL
 };
 
 static struct opts_graph {
const char *pack_dir;
+   int read;
+   const char *graph_id;
int write;
 } opts;
 
+static int graph_read(void)
+{
+   struct object_id graph_oid;
+   struct packed_graph *graph = 0;
+   const char *graph_file;
+
+   if (opts.graph_id && strlen(opts.graph_id) == GIT_MAX_HEXSZ)
+   get_oid_hex(opts.graph_id, _oid);
+   else
+   die("no graph id specified");
+
+   graph_file = get_graph_filename_oid(opts.pack_dir, _oid);
+   graph = load_packed_graph_one(graph_file, opts.pack_dir);
+
+   if (!graph)
+   die("graph file %s does not exist.\n", graph_file);
+
+   printf("header: %08x %02x %02x %02x %02x\n",
+   ntohl(graph->hdr->graph_signature),
+   graph->hdr->graph_version,
+   graph->hdr->hash_version,
+   graph->hdr->hash_len,
+   graph->hdr->num_chunks);
+   printf("num_commits: %u\n", graph->num_commits);
+   printf("chunks:");
+
+   if (graph->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (graph->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (graph->chunk_commit_data)
+   printf(" commit_metadata");
+   if (graph->chunk_large_edges)
+   printf(" large_edges");
+   printf("\n");
+
+   printf("pack_dir: %s\n", graph->pack_dir);
+   return 0;
+}
+
 static int graph_write(void)
 {
struct object_id *graph_id = construct_graph(opts.pack_dir);
@@ -36,8 +79,14 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('r', "read", ,
+   N_("read graph file")),
OPT_BOOL('w', "write", ,
N_("write graph file")),
+   { OPTION_STRING, 'M', "graph-id", _id,
+   N_("oid"),
+   N_("An OID for a specific graph file in the pack-dir."),
+   PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
OPT_END(),
};
 
@@ -52,6 +101,9 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
 builtin_graph_options,
 builtin_graph_usage, 0);
 
+   if (opts.write + opts.read > 1)
+   usage_with_options(builtin_graph_usage, builtin_graph_options);
+
if (!opts.pack_dir) {
struct strbuf path = STRBUF_INIT;
strbuf_addstr(, get_object_directory());
@@ -59,6 +111,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
opts.pack_dir = strbuf_detach(, NULL);
}
 
+   if (opts.read)
+   return graph_read();
if (opts.write)
return graph_write();
 
diff --git a/packed-graph.c b/packed-graph.c
index 9be9811667..eaa656becb 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -30,6 +30,11 @@
 
 #define GRAPH_LAST_EDGE 0x8000
 
+#define GRAP

[PATCH 12/14] packed-graph: read only from specific pack-indexes

2018-01-25 Thread Derrick Stolee
Teach git-graph to inspect the objects only in a certain list of
pack-indexes within the given pack directory. This allows updating
the graph iteratively, since we add all commits stored in a previous
packed graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt | 12 
 builtin/graph.c | 26 +++---
 packed-graph.c  | 27 +++
 packed-graph.h  |  2 +-
 packfile.c  |  4 ++--
 packfile.h  |  2 ++
 t/t5319-graph.sh| 10 ++
 7 files changed, 69 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index f4f1048d28..b68a61ddea 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -43,6 +43,11 @@ OPTIONS
When used with --write and --update-head, delete the graph file
previously referenced by graph-head.
 
+--stdin-packs::
+   When used with --write, generate the new graph by walking objects
+   only in the specified packfiles and any commits in the
+   existing graph-head.
+
 EXAMPLES
 
 
@@ -65,6 +70,13 @@ $ git graph --write
 $ git graph --write --update-head --delete-expired
 
 
+* Write a graph file, extending the current graph file using commits
+* in , update graph-head, and delete the old graph-.graph file.
++
+
+$ echo  | git graph --write --update-head --delete-expired 
--stdin-packs
+
+
 * Read basic information from a graph file.
 +
 
diff --git a/builtin/graph.c b/builtin/graph.c
index adf779b601..3cace3a18c 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired]"),
+   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs]"),
NULL
 };
 
@@ -24,6 +24,7 @@ static struct opts_graph {
int write;
int update_head;
int delete_expired;
+   int stdin_packs;
int has_existing;
struct object_id old_graph_oid;
 } opts;
@@ -113,7 +114,24 @@ static void update_head_file(const char *pack_dir, const 
struct object_id *graph
 
 static int graph_write(void)
 {
-   struct object_id *graph_id = construct_graph(opts.pack_dir);
+   struct object_id *graph_id;
+   char **pack_indexes = NULL;
+   int num_packs = 0;
+   int size_packs = 0;
+
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   size_packs = 128;
+   ALLOC_ARRAY(pack_indexes, size_packs);
+
+   while (strbuf_getline(, stdin) != EOF) {
+   ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
+   pack_indexes[num_packs++] = buf.buf;
+   strbuf_detach(, NULL);
+   }
+   }
+
+   graph_id = construct_graph(opts.pack_dir, pack_indexes, num_packs);
 
if (opts.update_head)
update_head_file(opts.pack_dir, graph_id);
@@ -150,7 +168,9 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("update graph-head to written graph file")),
OPT_BOOL('d', "delete-expired", _expired,
N_("delete expired head graph file")),
-   { OPTION_STRING, 'M', "graph-id", _id,
+   OPT_BOOL('s', "stdin-packs", _packs,
+   N_("only scan packfiles listed by stdin")),
+   { OPTION_STRING, 'G', "graph-id", _id,
N_("oid"),
N_("An OID for a specific graph file in the pack-dir."),
PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
diff --git a/packed-graph.c b/packed-graph.c
index 343b231973..0dc68a077e 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -401,7 +401,7 @@ static int fill_packed_commit(struct commit *item, struct 
packed_graph *g, uint3
  *  2. date
  *  3. parents.
  *
- * Returns 1 iff the commit was found in the packed graph.
+ * Returns 1 if and only if the commit was found in the packed graph.
  *
  * See parse_commit_buffer() for the fallback after this call.
  */
@@ -427,7 +427,7 @@ int parse_packed_commit(struct commit *item)
return fill_packed_commit(item, packed_graph, pos);
}
 
-   return parse_commit_internal(item, 0, 0);
+   return 0;
 }
 
 static void write_graph_chunk

[PATCH 13/14] packed-graph: close under reachability

2018-01-25 Thread Derrick Stolee
Teach construct_graph() to walk all parents from the commits discovered in
packfiles. This prevents gaps given by loose objects or previously-missed
packfiles.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 packed-graph.c   | 26 ++
 t/t5319-graph.sh | 14 ++
 2 files changed, 40 insertions(+)

diff --git a/packed-graph.c b/packed-graph.c
index 0dc68a077e..c93515f18e 100644
--- a/packed-graph.c
+++ b/packed-graph.c
@@ -5,6 +5,7 @@
 #include "packfile.h"
 #include "commit.h"
 #include "object.h"
+#include "revision.h"
 #include "packed-graph.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
@@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct 
object_id *oid,
return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct rev_info revs;
+   struct commit *commit;
+   init_revisions(, NULL);
+
+   for (i = 0; i < oids->num; i++) {
+   commit = lookup_commit(oids->list[i]);
+   if (commit && !parse_commit(commit))
+   revs.commits = commit_list_insert(commit, 
);
+   }
+
+   if (prepare_revision_walk())
+   die(_("revision walk setup failed"));
+
+   while ((commit = get_revision()) != NULL) {
+   ALLOC_GROW(oids->list, oids->num + 1, oids->size);
+   oids->list[oids->num] = &(commit->object.oid);
+   (oids->num)++;
+   }
+}
+
 struct object_id *construct_graph(const char *pack_dir, char **pack_indexes, 
int nr_packs)
 {
// Find a list of oids, adding the pointer to a list.
@@ -698,6 +722,8 @@ struct object_id *construct_graph(const char *pack_dir, 
char **pack_indexes, int
} else {
for_each_packed_object(if_packed_commit_add_to_list, , 0);
}
+
+   close_reachable();
QSORT(oids.list, oids.num, commit_compare);
 
count_distinct = 1;
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
index 969150cd21..8bf5a0c993 100755
--- a/t/t5319-graph.sh
+++ b/t/t5319-graph.sh
@@ -212,6 +212,20 @@ test_expect_success 'clear graph' \
 _graph_git_behavior commits/20 merge/1
 _graph_git_behavior commits/20 merge/2
 
+test_expect_success 'build graph from latest pack with closure' \
+'graph5=$(cat new-idx | git graph --write --update-head --stdin-packs) &&
+ test_path_is_file ${packdir}/graph-${graph5}.graph &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph &&
+ test_path_is_file ${packdir}/graph-head &&
+ echo ${graph5} >expect &&
+ cmp -n 40 expect ${packdir}/graph-head &&
+ git graph --read --graph-id=${graph5} >output &&
+ _graph_read_expect "21" "${packdir}" &&
+ cmp expect output'
+
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'setup bare repo' \
 'cd .. &&
  git clone --bare full bare &&
-- 
2.16.0



[PATCH 11/14] commit: integrate packed graph with commit parsing

2018-01-25 Thread Derrick Stolee
Teach Git to inspect a packed graph to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date. The only loosely-expected
condition is that the commit buffer is loaded into the cache. This
was checked in log-tree.c:show_log(), but the "return;" on failure
produced unexpected results (i.e. the message line was never terminated).
The new behavior of loading the buffer when needed prevents the
unexpected behavior.

If core.graph is false, then do not load the graph and behave as usual.

In test script t5319-graph.sh, add output-matching conditions on read-
only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commits walks. Here are some performance
results for a copy of the Linux repository where 'master' has 704,766
reachable commits and is behind 'origin/master' by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c  |   1 +
 commit.c |  20 -
 commit.h |   2 +
 log-tree.c   |   3 +-
 packed-graph.c   | 242 +++
 packed-graph.h   |  18 +
 t/t5319-graph.sh | 114 --
 7 files changed, 387 insertions(+), 13 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..4a4dcfa2b7 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
struct commit *c = alloc_node(_state, sizeof(struct commit));
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
+   c->graphId = 0x;
return c;
 }
 
diff --git a/commit.c b/commit.c
index cab8d4455b..253c102808 100644
--- a/commit.c
+++ b/commit.c
@@ -12,6 +12,7 @@
 #include "prio-queue.h"
 #include "sha1-lookup.h"
 #include "wt-status.h"
+#include "packed-graph.h"
 
 static struct commit_extra_header *read_commit_extra_header_lines(const char 
*buf, size_t len, const char **);
 
@@ -374,7 +375,7 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
check_packed)
 {
enum object_type type;
void *buffer;
@@ -383,19 +384,27 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
 
if (!item)
return -1;
+
+   // If we already parsed, but got it from the graph, then keep going!
if (item->object.parsed)
return 0;
+
+   if (check_packed && parse_packed_commit(item))
+   return 0;
+
buffer = read_sha1_file(item->object.oid.hash, , );
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
}
+
ret = parse_commit_buffer(item, buffer, size);
+
if (save_commit_buffer && !ret) {
set_commit_buffer(item, buffer, size);
return 0;
@@ -404,6 +413,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index 8c68ca1a5a..02f5f2a182 100644
--- a/commit.h
+++ b/commit.h
@@ -21,6 +21,7 @@ struct commit {
timestamp_t date;
struct commit_list *parents;
struct tree *tree;
+   uint32_t graphId;
 };
 
 extern int save_commit_buffer;
@@ -60,6 +61,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size);
+extern int parse_commit_internal(struct commit *item, int quiet_on_missing, 
int check_packed);
 int parse_commit_gently(struct com

[PATCH 10/14] packed-graph: teach git-graph --delete-expired

2018-01-25 Thread Derrick Stolee
Teach git-graph to delete the graph previously referenced by 'graph_head'
when writing a new graph file and updating 'graph_head'. This prevents
data creep by storing a list of useless graphs. Be careful to not delete
the graph if the file did not change.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-graph.txt |  8 ++--
 builtin/graph.c | 14 +-
 t/t5319-graph.sh| 37 +++--
 3 files changed, 54 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-graph.txt b/Documentation/git-graph.txt
index f690699570..f4f1048d28 100644
--- a/Documentation/git-graph.txt
+++ b/Documentation/git-graph.txt
@@ -39,6 +39,10 @@ OPTIONS
When used with --write, update the graph-head file to point to
the written graph file.
 
+--delete-expired::
+   When used with --write and --update-head, delete the graph file
+   previously referenced by graph-head.
+
 EXAMPLES
 
 
@@ -55,10 +59,10 @@ $ git graph --write
 
 
 * Write a graph file for the packed commits in your local .git folder,
-* and update graph-head.
+* update graph-head, and delete the old graph-.graph file.
 +
 
-$ git graph --write --update-head
+$ git graph --write --update-head --delete-expired
 
 
 * Read basic information from a graph file.
diff --git a/builtin/graph.c b/builtin/graph.c
index ac15febc46..adf779b601 100644
--- a/builtin/graph.c
+++ b/builtin/graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_graph_usage[] ={
N_("git graph [--pack-dir ]"),
N_("git graph --clear [--pack-dir ]"),
N_("git graph --read [--graph-id=]"),
-   N_("git graph --write [--pack-dir ] [--update-head]"),
+   N_("git graph --write [--pack-dir ] [--update-head] 
[--delete-expired]"),
NULL
 };
 
@@ -23,6 +23,7 @@ static struct opts_graph {
const char *graph_id;
int write;
int update_head;
+   int delete_expired;
int has_existing;
struct object_id old_graph_oid;
 } opts;
@@ -120,6 +121,15 @@ static int graph_write(void)
if (graph_id)
printf("%s\n", oid_to_hex(graph_id));
 
+   if (opts.delete_expired && opts.update_head && opts.has_existing &&
+   oidcmp(graph_id, _graph_oid)) {
+   char *old_path = get_graph_filename_oid(opts.pack_dir, 
_graph_oid);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+
+   free(old_path);
+   }
+
free(graph_id);
return 0;
 }
@@ -138,6 +148,8 @@ int cmd_graph(int argc, const char **argv, const char 
*prefix)
N_("write graph file")),
OPT_BOOL('u', "update-head", _head,
N_("update graph-head to written graph file")),
+   OPT_BOOL('d', "delete-expired", _expired,
+   N_("delete expired head graph file")),
{ OPTION_STRING, 'M', "graph-id", _id,
N_("oid"),
N_("An OID for a specific graph file in the pack-dir."),
diff --git a/t/t5319-graph.sh b/t/t5319-graph.sh
index 311fb9dd67..a70c7bbb02 100755
--- a/t/t5319-graph.sh
+++ b/t/t5319-graph.sh
@@ -80,9 +80,42 @@ test_expect_success 'write graph with merges' \
  _graph_read_expect "18" "${packdir}" &&
  cmp expect output'
 
+test_expect_success 'Add more commits' \
+'git reset --hard commits/3 &&
+ for i in $(test_seq 16 20)
+ do
+git commit --allow-empty -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+test_expect_success 'write graph with merges' \
+'graph3=$(git graph --write --update-head --delete-expired) &&
+ test_path_is_file ${packdir}/graph-${graph3}.graph &&
+ test_path_is_missing ${packdir}/graph-${graph2}.graph &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph &&
+ test_path_is_file ${packdir}/graph-head &&
+ echo ${graph3} >expect &&
+ cmp -n 40 expect ${packdir}/graph-head &&
+ git graph --read --graph-id=${graph3} >output &&
+ _graph_read_expect "23" "${packdir}" &&
+ cmp expect output'
+
+test_expect_success 'write graph with nothing new' \
+'graph4=$(git graph --write --update-head --delete-expired) &&
+ test_path_is_file ${packdir}/graph-${graph4}.graph &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph &&
+ test_path_

[PATCH v2 01/14] commit-graph: add format document

2018-01-30 Thread Derrick Stolee
Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 89 +
 1 file changed, 89 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 00..8a987c7aa9
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,89 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+  The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+  Currently, the only valid version is 1.
+
+  1-byte Object Id Version (1 = SHA-1)
+
+  1-byte Object Id Length (H)
+
+  1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide offset in current file for chunk to start.
+  (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.)
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the int-ids of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'})
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data is a
+  negative number pointing into this list. Then iterate through this
+  list starting at that position until reaching a value with the most-
+  significant bit on. The other bits correspond to the int-id of the
+  last parent. This chunk should always be present, but may be empty.
+
+TRAILER:
+
+   H-byte HASH-checksum of all of the above.
-- 
2.16.0.15.g9c3cf44.dirty



[PATCH v2 11/14] commit: integrate commit graph with commit parsing

2018-01-30 Thread Derrick Stolee
Teach Git to inspect a commit graph file to supply the contents of a
struct commit when calling parse_commit_gently(). This implementation
satisfies all post-conditions on the struct commit, including loading
parents, the root tree, and the commit date. The only loosely-expected
condition is that the commit buffer is loaded into the cache. This
was checked in log-tree.c:show_log(), but the "return;" on failure
produced unexpected results (i.e. the message line was never terminated).
The new behavior of loading the buffer when needed prevents the
unexpected behavior.

If core.commitgraph is false, then do not check graph files.

In test script t5319-commit-graph.sh, add output-matching conditions on
read-only graph operations.

By loading commits from the graph instead of parsing commit buffers, we
save a lot of time on long commit walks. Here are some performance
results for a copy of the Linux repository where 'master' has 704,766
reachable commits and is behind 'origin/master' by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c |   1 +
 commit-graph.c  | 237 
 commit-graph.h  |  20 +++-
 commit.c|  10 +-
 commit.h|   4 +
 log-tree.c  |   3 +-
 t/t5318-commit-graph.sh |  47 ++
 7 files changed, 318 insertions(+), 4 deletions(-)

diff --git a/alloc.c b/alloc.c
index 12afadfacd..cf4f8b61e1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -93,6 +93,7 @@ void *alloc_commit_node(void)
struct commit *c = alloc_node(_state, sizeof(struct commit));
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
+   c->graph_pos = COMMIT_NOT_FROM_GRAPH;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 764e016ddb..fc816533c6 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -35,6 +35,9 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
GRAPH_OID_LEN + sizeof(struct commit_graph_header))
 
+/* global storage */
+struct commit_graph *commit_graph = 0;
+
 struct object_id *get_graph_head_hash(const char *pack_dir, struct object_id 
*hash)
 {
struct strbuf head_filename = STRBUF_INIT;
@@ -209,6 +212,220 @@ struct commit_graph *load_commit_graph_one(const char 
*graph_file, const char *p
return graph;
 }
 
+static void prepare_commit_graph_one(const char *obj_dir)
+{
+   char *graph_file;
+   struct object_id oid;
+   struct strbuf pack_dir = STRBUF_INIT;
+   strbuf_addstr(_dir, obj_dir);
+   strbuf_add(_dir, "/pack", 5);
+
+   if (!get_graph_head_hash(pack_dir.buf, ))
+   return;
+
+   graph_file = get_commit_graph_filename_hash(pack_dir.buf, );
+
+   commit_graph = load_commit_graph_one(graph_file, pack_dir.buf);
+   strbuf_release(_dir);
+}
+
+static int prepare_commit_graph_run_once = 0;
+void prepare_commit_graph(void)
+{
+   struct alternate_object_database *alt;
+   char *obj_dir;
+
+   if (prepare_commit_graph_run_once)
+   return;
+   prepare_commit_graph_run_once = 1;
+
+   obj_dir = get_object_directory();
+   prepare_commit_graph_one(obj_dir);
+   prepare_alt_odb();
+   for (alt = alt_odb_list; !commit_graph && alt; alt = alt->next)
+   prepare_commit_graph_one(alt->path);
+}
+
+static int bsearch_graph(struct commit_graph *g, struct object_id *oid, 
uint32_t *pos)
+{
+   uint32_t last, first = 0;
+
+   if (oid->hash[0])
+   first = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * 
(oid->hash[0] - 1)));
+   last = ntohl(*(uint32_t*)(g->chunk_oid_fanout + 4 * oid->hash[0]));
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   const unsigned char *current;
+   int cmp;
+
+   current = g->chunk_oid_lookup + g->hdr->hash_len * mid;
+   cmp = hashcmp(oid->hash, current);
+   if (!cmp) {
+   *pos = mid;
+   return 1;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   *pos = first;
+   return 0;
+}
+
+struct object_id *get_nth_commit_oid(struct commit_graph *g,
+uint32_t n,
+struct object_id *oid)
+{
+   

[PATCH v2 06/14] commit-graph: implement git-commit-graph --read

2018-01-30 Thread Derrick Stolee
Teach git-commit-graph to read commit graph files and summarize their contents.

Use the --read option to verify the contents of a commit graph file in the
tests.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |   7 ++
 builtin/commit-graph.c |  55 +++
 commit-graph.c | 138 -
 commit-graph.h |  25 +++
 t/t5318-commit-graph.sh|  28 ++--
 5 files changed, 247 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 3f3790d9a8..09aeaf6c82 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -10,6 +10,7 @@ SYNOPSIS
 
 [verse]
 'git commit-graph' --write  [--pack-dir ]
+'git commit-graph' --read  [--pack-dir ]
 
 EXAMPLES
 
@@ -20,6 +21,12 @@ EXAMPLES
 $ git commit-graph --write
 
 
+* Read basic information from a graph file.
++
+
+$ git commit-graph --read --graph-hash=
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 7affd512f1..218740b1f8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,15 +10,58 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
+   N_("git commit-graph --read [--graph-hash=]"),
N_("git commit-graph --write [--pack-dir ]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *pack_dir;
+   int read;
+   const char *graph_hash;
int write;
 } opts;
 
+static int graph_read(void)
+{
+   struct object_id graph_hash;
+   struct commit_graph *graph = 0;
+   const char *graph_file;
+
+   if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
+   get_oid_hex(opts.graph_hash, _hash);
+   else
+   die("no graph hash specified");
+
+   graph_file = get_commit_graph_filename_hash(opts.pack_dir, _hash);
+   graph = load_commit_graph_one(graph_file, opts.pack_dir);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_file);
+
+   printf("header: %08x %02x %02x %02x %02x\n",
+   ntohl(graph->hdr->graph_signature),
+   graph->hdr->graph_version,
+   graph->hdr->hash_version,
+   graph->hdr->hash_len,
+   graph->hdr->num_chunks);
+   printf("num_commits: %u\n", graph->num_commits);
+   printf("chunks:");
+
+   if (graph->chunk_oid_fanout)
+   printf(" oid_fanout");
+   if (graph->chunk_oid_lookup)
+   printf(" oid_lookup");
+   if (graph->chunk_commit_data)
+   printf(" commit_metadata");
+   if (graph->chunk_large_edges)
+   printf(" large_edges");
+   printf("\n");
+
+   printf("pack_dir: %s\n", graph->pack_dir);
+   return 0;
+}
+
 static int graph_write(void)
 {
struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
@@ -36,8 +79,14 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('r', "read", ,
+   N_("read graph file")),
OPT_BOOL('w', "write", ,
N_("write commit graph file")),
+   { OPTION_STRING, 'H', "graph-hash", _hash,
+   N_("hash"),
+   N_("A hash for a specific graph file in the pack-dir."),
+   PARSE_OPT_OPTARG, NULL, (intptr_t) "" },
OPT_END(),
};
 
@@ -49,6 +98,10 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
 builtin_commit_graph_options,
 builtin_commit_graph_usage, 0);
 
+   if (opts.write + opts.read > 1)
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
if (!opts.pack_dir) {
struct strbuf path = STRBUF_INIT;
strbuf_addstr(, get_object_directory());
@@ -56,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
opts.pack_dir = strbuf_detach(, NULL);
}
 
+   if (opts.read)
+ 

[PATCH v2 08/14] commit-graph: implement git-commit-graph --clear

2018-01-30 Thread Derrick Stolee
Teach Git to delete the current 'graph_head' file and the commit graph
it references. This is a good safety valve if somehow the file is
corrupted and needs to be recalculated. Since the commit graph is a
summary of contents already in the ODB, it can be regenerated.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 16 ++--
 builtin/commit-graph.c | 32 +++-
 t/t5318-commit-graph.sh|  7 ++-
 3 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 99ced16ddc..33d6567f11 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -11,6 +11,7 @@ SYNOPSIS
 [verse]
 'git commit-graph' --write  [--pack-dir ]
 'git commit-graph' --read  [--pack-dir ]
+'git commit-graph' --clear [--pack-dir ]
 
 OPTIONS
 ---
@@ -18,16 +19,21 @@ OPTIONS
Use given directory for the location of packfiles, graph-head,
and graph files.
 
+--clear::
+   Delete the graph-head file and the graph file it references.
+   (Cannot be combined with --read or --write.)
+
 --read::
Read a graph file given by the graph-head file and output basic
-   details about the graph file. (Cannot be combined with --write.)
+   details about the graph file. (Cannot be combined with --clear
+   or --write.)
 
 --graph-id::
When used with --read, consider the graph file graph-.graph.
 
 --write::
Write a new graph file to the pack directory. (Cannot be combined
-   with --read.)
+   with --clear or --read.)
 
 --update-head::
When used with --write, update the graph-head file to point to
@@ -61,6 +67,12 @@ $ git commit-graph --write --update-head
 $ git commit-graph --read --graph-hash=
 
 
+* Delete /graph-head and the file it references.
++
+
+$ git commit-graph --clear --pack-dir=
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index d73cbc907d..4970dec133 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -10,6 +10,7 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
+   N_("git commit-graph --clear [--pack-dir ]"),
N_("git commit-graph --read [--graph-hash=]"),
N_("git commit-graph --write [--pack-dir ] [--update-head]"),
NULL
@@ -17,6 +18,7 @@ static char const * const builtin_commit_graph_usage[] = {
 
 static struct opts_commit_graph {
const char *pack_dir;
+   int clear;
int read;
const char *graph_hash;
int write;
@@ -25,6 +27,30 @@ static struct opts_commit_graph {
struct object_id old_graph_hash;
 } opts;
 
+static int graph_clear(void)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   char *old_path;
+
+   if (!opts.has_existing)
+   return 0;
+
+   strbuf_addstr(_path, opts.pack_dir);
+   strbuf_addstr(_path, "/");
+   strbuf_addstr(_path, "graph-head");
+   if (remove_path(head_path.buf))
+   die("failed to remove path %s", head_path.buf);
+   strbuf_release(_path);
+
+   old_path = get_commit_graph_filename_hash(opts.pack_dir,
+ _graph_hash);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+   free(old_path);
+
+   return 0;
+}
+
 static int graph_read(void)
 {
struct object_id graph_hash;
@@ -105,6 +131,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('c', "clear", ,
+   N_("clear graph file and graph-head")),
OPT_BOOL('r', "read", ,
N_("read graph file")),
OPT_BOOL('w', "write", ,
@@ -126,7 +154,7 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 builtin_commit_graph_options,
 builtin_commit_graph_usage, 0);
 
-   if (opts.write + opts.read > 1)
+   if (opts.write + opts.read + opts.clear > 1)
usage_with_options(builtin_commit_graph_usage,
   builtin_commit_graph_options);
 
@@ -139,6 +167,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 
opts.has_existing = !!get_graph_head_hash(opts.pack_dir, 

[PATCH v2 03/14] commit-graph: create git-commit-graph builtin

2018-01-30 Thread Derrick Stolee
Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a '--pack-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |  1 +
 Documentation/git-commit-graph.txt |  7 +++
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/commit-graph.c | 33 +
 command-list.txt   |  1 +
 git.c  |  1 +
 7 files changed, 45 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
new file mode 100644
index 00..c8ea548dfb
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,7 @@
+git-commit-graph(1)
+
+
+NAME
+
+git-commit-graph - Write and verify Git commit graphs (.graph files)
+
diff --git a/Makefile b/Makefile
index 1a9b23b679..aee5d3f7b9 100644
--- a/Makefile
+++ b/Makefile
@@ -965,6 +965,7 @@ BUILTIN_OBJS += builtin/for-each-ref.o
 BUILTIN_OBJS += builtin/fsck.o
 BUILTIN_OBJS += builtin/gc.o
 BUILTIN_OBJS += builtin/get-tar-commit-id.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/grep.o
 BUILTIN_OBJS += builtin/hash-object.o
 BUILTIN_OBJS += builtin/help.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const 
char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 00..2104550d25
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,33 @@
+#include "builtin.h"
+#include "cache.h"
+#include "config.h"
+#include "dir.h"
+#include "git-compat-util.h"
+#include "lockfile.h"
+#include "packfile.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+   N_("git commit-graph [--pack-dir ]"),
+   NULL
+};
+
+static struct opts_commit_graph {
+   const char *pack_dir;
+} opts;
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_commit_graph_options[] = {
+   { OPTION_STRING, 'p', "pack-dir", _dir,
+   N_("dir"),
+   N_("The pack directory to store the graph") },
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
+   return 0;
+}
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean   mainporcelain
 git-clone   mainporcelain   init
 git-column  purehelpers
 git-commit  mainporcelain   history
+git-commit-graphplumbingmanipulators
 git-commit-tree plumbingmanipulators
 git-config  ancillarymanipulators
 git-count-objects   ancillaryinterrogators
diff --git a/git.c b/git.c
index c870b9719c..c7b5adae7b 100644
--- a/git.c
+++ b/git.c
@@ -388,6 +388,7 @@ static struct cmd_struct commands[] = {
{ "clone", cmd_clone },
{ "column", cmd_column, RUN_SETUP_GENTLY },
{ "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE },
+   { "commit-graph", cmd_commit_graph, RUN_SETUP },
{ "commit-tree", cmd_commit_tree, RUN_SETUP },
{ "config", cmd_config, RUN_SETUP_GENTLY },
{ "count-objects", cmd_count_objects, RUN_SETUP },
-- 
2.16.0.15.g9c3cf44.dirty



[PATCH v2 00/14] Serialized Git Commit Graph

2018-01-30 Thread Derrick Stolee
Thanks to everyone who gave comments on v1. I tried my best to respond to
all of the feedback, but may have missed some while I was doing several
renames, including:

* builtin/graph.c -> builtin/commit-graph.c
* packed-graph.[c|h] -> commit-graph.[c|h]
* t/t5319-graph.sh -> t/t5318-commit-graph.sh

Because of these renames (and several type/function renames) the diff
is too large to conveniently share here.

Some issues that came up and are addressed:

* Use  instead of  when referring to the graph-.graph
  filenames and the contents of graph-head.
* 32-bit timestamps will not cause undefined behavior.
* timestamp_t is unsigned, so they are never negative.
* The config setting "core.commitgraph" now only controls consuming the
  graph during normal operations and will not block the commit-graph
  plumbing command.
* The --stdin-commits is better about sanitizing the input for strings
  that do not parse to OIDs or are OIDs for non-commit objects.

One unresolved comment that I would like consensus on is the use of
globals to store the config setting and the graph state. I'm currently
using the pattern from packed_git instead of putting these values in
the_repository. However, we want to eventually remove globals like
packed_git. Should I deviate from the pattern _now_ in order to keep
the problem from growing, or should I keep to the known pattern?

Finally, I tried to clean up my incorrect style as I was recreating
these commits. Feel free to be merciless in style feedback now that the
architecture is more stable.

Thanks,
-Stolee

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitgraph true
  git show-ref -s | git graph --write --update-head --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] 
https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/
Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement construct_commit_graph()
  commit-graph: implement git-commit-graph --write
  commit-graph: implement git-commit-graph --read
  commit-graph: implement git-commit-graph --update-head
  commit-graph: implement git-commit-graph --clear
  commit-graph: teach git-commit-graph --delete-expired
  commit-graph: add core.commitgraph setting
  commit: integrate commit graph with commit parsing
  commit-graph: read only from specific pack-indexes
  commit-graph: close under

[PATCH v2 10/14] commit-graph: add core.commitgraph setting

2018-01-30 Thread Derrick Stolee
The commit graph feature is controlled by the new core.commitgraph config
setting. This defaults to 0, so the feature is opt-in.

The intention of core.commitgraph is that a user can always stop checking
for or parsing commit graph files if core.commitgraph=0.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/config.txt | 3 +++
 cache.h  | 1 +
 config.c | 5 +
 environment.c| 1 +
 4 files changed, 10 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..5b63559a2b 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -898,6 +898,9 @@ core.notesRef::
 This setting defaults to "refs/notes/commits", and it can be overridden by
 the `GIT_NOTES_REF` environment variable.  See linkgit:git-notes[1].
 
+core.commitgraph::
+   Enable git commit graph feature. Allows reading from .graph files.
+
 core.sparseCheckout::
Enable "sparse checkout" feature. See section "Sparse checkout" in
linkgit:git-read-tree[1] for more information.
diff --git a/cache.h b/cache.h
index d8b975a571..e50e447a4f 100644
--- a/cache.h
+++ b/cache.h
@@ -825,6 +825,7 @@ extern char *git_replace_ref_base;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_apply_sparse_checkout;
+extern int core_commitgraph;
 extern int precomposed_unicode;
 extern int protect_hfs;
 extern int protect_ntfs;
diff --git a/config.c b/config.c
index e617c2018d..99153fcfdb 100644
--- a/config.c
+++ b/config.c
@@ -1223,6 +1223,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.commitgraph")) {
+   core_commitgraph = git_config_bool(var, value);
+   return 0;
+   }
+
if (!strcmp(var, "core.sparsecheckout")) {
core_apply_sparse_checkout = git_config_bool(var, value);
return 0;
diff --git a/environment.c b/environment.c
index 63ac38a46f..faa4323cc5 100644
--- a/environment.c
+++ b/environment.c
@@ -61,6 +61,7 @@ enum object_creation_mode object_creation_mode = 
OBJECT_CREATION_MODE;
 char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
+int core_commitgraph;
 int merge_log_config = -1;
 int precomposed_unicode = -1; /* see probe_utf8_pathname_composition() */
 unsigned long pack_size_limit_cfg;
-- 
2.16.0.15.g9c3cf44.dirty



[PATCH v2 02/14] graph: add commit graph design document

2018-01-30 Thread Derrick Stolee
Add Documentation/technical/commit-graph.txt with details of the planned
commit graph feature, including future plans.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 189 +++
 1 file changed, 189 insertions(+)
 create mode 100644 Documentation/technical/commit-graph.txt

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
new file mode 100644
index 00..cbf88f7264
--- /dev/null
+++ b/Documentation/technical/commit-graph.txt
@@ -0,0 +1,189 @@
+Git Commit Graph Design Notes
+=
+
+Git walks the commit graph for many reasons, including:
+
+1. Listing and filtering commit history.
+2. Computing merge bases.
+
+These operations can become slow as the commit count grows. The merge
+base calculation shows up in many user-facing commands, such as 'merge-base'
+or 'git show --remerge-diff' and can take minutes to compute depending on
+history shape.
+
+There are two main costs here:
+
+1. Decompressing and parsing commits.
+2. Walking the entire graph to avoid topological order mistakes.
+
+The commit graph file is a supplemental data structure that accelerates
+commit graph walks. If a user downgrades or disables the 'core.commitgraph'
+config setting, then the existing ODB is sufficient. The file is stored
+next to packfiles either in the .git/objects/pack directory or in the pack
+directory of an alternate.
+
+The commit graph file stores the commit graph structure along with some
+extra metadata to speed up graph walks. By listing commit OIDs in lexi-
+cographic order, we can identify an integer position for each commit and
+refer to the parents of a commit using those integer positions. We use
+binary search to find initial commits and then use the integer positions
+for fast lookups during the walk.
+
+A consumer may load the following info for a commit from the graph:
+
+1. The commit OID.
+2. The list of parents, along with their integer position.
+3. The commit date.
+4. The root tree OID.
+5. The generation number (see definition below).
+
+Values 1-4 satisfy the requirements of parse_commit_gently().
+
+Define the "generation number" of a commit recursively as follows:
+
+ * A commit with no parents (a root commit) has generation number one.
+
+ * A commit with at least one parent has generation number one more than
+   the largest generation number among its parents.
+
+Equivalently, the generation number of a commit A is one more than the
+length of a longest path from A to a root commit. The recursive definition
+is easier to use for computation and observing the following property:
+
+If A and B are commits with generation numbers N and M, respectively,
+and N <= M, then A cannot reach B. That is, we know without searching
+that B is not an ancestor of A because it is further from a root commit
+than A.
+
+Conversely, when checking if A is an ancestor of B, then we only need
+to walk commits until all commits on the walk boundary have generation
+number at most N. If we walk commits using a priority queue seeded by
+generation numbers, then we always expand the boundary commit with highest
+generation number and can easily detect the stopping condition.
+
+This property can be used to significantly reduce the time it takes to
+walk commits and determine topological relationships. Without generation
+numbers, the general heuristic is the following:
+
+If A and B are commits with commit time X and Y, respectively, and
+X < Y, then A _probably_ cannot reach B.
+
+This heuristic is currently used whenever the computation can make
+mistakes with topological orders (such as "git log" with default order),
+but is not used when the topological order is required (such as merge
+base calculations, "git log --graph").
+
+In practice, we expect some commits to be created recently and not stored
+in the commit graph. We can treat these commits as having "infinite"
+generation number and walk until reaching commits with known generation
+number.
+
+Design Details
+--
+
+- A graph file is stored in a file named 'graph-.graph' in the pack
+  directory. This could be stored in an alternate.
+
+- The most-recent graph file hash is stored in a 'graph-head' file for
+  immediate access and storing backup graphs. This could be stored in an
+  alternate, and refers to a 'graph-.graph' file in the same pack
+  directory.
+
+- The core.commitgraph config setting must be on to consume graph files.
+
+- The file format includes parameters for the object id length and hash
+  algorithm, so a future change of hash algorithm does not require a change
+  in format.
+
+Current Limitations
+---
+
+- Only one graph file is used at one time. This allows the integer position
+  to seek into the single graph file. It is possible to extend the mode

[PATCH v2 12/14] commit-graph: read only from specific pack-indexes

2018-01-30 Thread Derrick Stolee
Teach git-commit-graph to inspect the objects only in a certain list
of pack-indexes within the given pack directory. This allows updating
the commit graph iteratively, since we add all commits stored in a
previous commit graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 13 +
 builtin/commit-graph.c | 25 ++---
 commit-graph.c | 25 +++--
 commit-graph.h |  4 +++-
 packfile.c |  4 ++--
 packfile.h |  2 ++
 t/t5318-commit-graph.sh|  6 --
 7 files changed, 69 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 7b376e9212..d0571cd896 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -43,6 +43,11 @@ OPTIONS
When used with --write and --update-head, delete the graph file
previously referenced by graph-head.
 
+--stdin-packs::
+   When used with --write, generate the new graph by walking objects
+   only in the specified packfiles and any commits in the
+   existing graph-head.
+
 EXAMPLES
 
 
@@ -65,6 +70,14 @@ $ git commit-graph --write
 $ git commit-graph --write --update-head --delete-expired
 
 
+* Write a graph file, extending the current graph file using commits
+* in , update graph-head, and delete the old graph-.graph
+* file.
++
+
+$ echo  | git commit-graph --write --update-head --delete-expired 
--stdin-packs
+
+
 * Read basic information from a graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 766f09e6fc..80a409e784 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
N_("git commit-graph --clear [--pack-dir ]"),
N_("git commit-graph --read [--graph-hash=]"),
-   N_("git commit-graph --write [--pack-dir ] [--update-head] 
[--delete-expired]"),
+   N_("git commit-graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs]"),
NULL
 };
 
@@ -24,6 +24,7 @@ static struct opts_commit_graph {
int write;
int update_head;
int delete_expired;
+   int stdin_packs;
int has_existing;
struct object_id old_graph_hash;
 } opts;
@@ -114,7 +115,24 @@ static void update_head_file(const char *pack_dir, const 
struct object_id *graph
 
 static int graph_write(void)
 {
-   struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
+   struct object_id *graph_hash;
+   char **pack_indexes = NULL;
+   int num_packs = 0;
+   int size_packs = 0;
+
+   if (opts.stdin_packs) {
+   struct strbuf buf = STRBUF_INIT;
+   size_packs = 128;
+   ALLOC_ARRAY(pack_indexes, size_packs);
+
+   while (strbuf_getline(, stdin) != EOF) {
+   ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
+   pack_indexes[num_packs++] = buf.buf;
+   strbuf_detach(, NULL);
+   }
+   }
+
+   graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, 
num_packs);
 
if (opts.update_head)
update_head_file(opts.pack_dir, graph_hash);
@@ -122,7 +140,6 @@ static int graph_write(void)
if (graph_hash)
printf("%s\n", oid_to_hex(graph_hash));
 
-
if (opts.delete_expired && opts.update_head && opts.has_existing &&
oidcmp(graph_hash, _graph_hash)) {
char *old_path = get_commit_graph_filename_hash(opts.pack_dir,
@@ -153,6 +170,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
N_("update graph-head to written graph file")),
OPT_BOOL('d', "delete-expired", _expired,
N_("delete expired head graph file")),
+   OPT_BOOL('s', "stdin-packs", _packs,
+   N_("only scan packfiles listed by stdin")),
{ OPTION_STRING, 'H', "graph-hash", _hash,
N_("hash"),
N_("A hash for a specific graph file in the pack-dir."),
diff --git a/commit-graph.c b/commit-graph.c
index fc816533c6..e5a1d9ee8b 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -638,7 +638,9 @@ static int if_packed_commit_add_to_list(const struct 
object_id *oid,
return 0;
 }
 
-struct object_id 

[PATCH v2 09/14] commit-graph: teach git-commit-graph --delete-expired

2018-01-30 Thread Derrick Stolee
Teach git-commit-graph to delete the graph previously referenced by 'graph_head'
when writing a new graph file and updating 'graph_head'. This prevents
data creep by storing a list of useless graphs. Be careful to not delete
the graph if the file did not change.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 +++--
 builtin/commit-graph.c | 16 -
 t/t5318-commit-graph.sh| 66 +-
 3 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 33d6567f11..7b376e9212 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -39,6 +39,10 @@ OPTIONS
When used with --write, update the graph-head file to point to
the written graph file.
 
+--delete-expired::
+   When used with --write and --update-head, delete the graph file
+   previously referenced by graph-head.
+
 EXAMPLES
 
 
@@ -55,10 +59,10 @@ $ git commit-graph --write
 
 
 * Write a graph file for the packed commits in your local .git folder,
-* and update graph-head.
+* update graph-head, and delete the old graph-.graph file.
 +
 
-$ git commit-graph --write --update-head
+$ git commit-graph --write --update-head --delete-expired
 
 
 * Read basic information from a graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 4970dec133..766f09e6fc 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
N_("git commit-graph --clear [--pack-dir ]"),
N_("git commit-graph --read [--graph-hash=]"),
-   N_("git commit-graph --write [--pack-dir ] [--update-head]"),
+   N_("git commit-graph --write [--pack-dir ] [--update-head] 
[--delete-expired]"),
NULL
 };
 
@@ -23,6 +23,7 @@ static struct opts_commit_graph {
const char *graph_hash;
int write;
int update_head;
+   int delete_expired;
int has_existing;
struct object_id old_graph_hash;
 } opts;
@@ -121,6 +122,17 @@ static int graph_write(void)
if (graph_hash)
printf("%s\n", oid_to_hex(graph_hash));
 
+
+   if (opts.delete_expired && opts.update_head && opts.has_existing &&
+   oidcmp(graph_hash, _graph_hash)) {
+   char *old_path = get_commit_graph_filename_hash(opts.pack_dir,
+   
_graph_hash);
+   if (remove_path(old_path))
+   die("failed to remove path %s", old_path);
+
+   free(old_path);
+   }
+
free(graph_hash);
return 0;
 }
@@ -139,6 +151,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
N_("write commit graph file")),
OPT_BOOL('u', "update-head", _head,
N_("update graph-head to written graph file")),
+   OPT_BOOL('d', "delete-expired", _expired,
+   N_("delete expired head graph file")),
{ OPTION_STRING, 'H', "graph-hash", _hash,
N_("hash"),
N_("A hash for a specific graph file in the pack-dir."),
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 6e3b62b754..b56a6d4217 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -101,9 +101,73 @@ test_expect_success 'write graph with merges' \
  _graph_read_expect "18" "${packdir}" &&
  cmp expect output'
 
+test_expect_success 'Add more commits' \
+'for i in $(test_seq 16 20)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+# Current graph structure:
+#
+#  20
+#   |
+#  19
+#   |
+#  18
+#   |
+#  17
+#   |
+#  16
+#   |
+#  M3
+# / |\_
+#/ 10  15
+#   /   |  |
+#  /9 M2   14
+# | |/  \  |
+# | 8 M1 | 13
+# | |/ | \_|
+# 5 7  |   12
+# | |   \__|
+# 4 6  11
+# |/__/
+# 3
+# |
+# 2
+# |
+# 1
+
+test_expect_success 'write graph with merges' \
+'graph3=$(git commit-graph --write --update-head --delete-expired) &&
+ test_path_is_file ${packdir}/graph-${graph3}.graph &&
+ test_path_is_missing ${packdir}/graph-${graph2}.graph &&
+   

[PATCH v2 05/14] commit-graph: implement git-commit-graph --write

2018-01-30 Thread Derrick Stolee
Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 18 +++
 builtin/commit-graph.c | 30 
 t/t5318-commit-graph.sh| 96 ++
 3 files changed, 144 insertions(+)
 create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index c8ea548dfb..3f3790d9a8 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,3 +5,21 @@ NAME
 
 git-commit-graph - Write and verify Git commit graphs (.graph files)
 
+
+SYNOPSIS
+
+[verse]
+'git commit-graph' --write  [--pack-dir ]
+
+EXAMPLES
+
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+
+$ git commit-graph --write
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 2104550d25..7affd512f1 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -6,22 +6,38 @@
 #include "lockfile.h"
 #include "packfile.h"
 #include "parse-options.h"
+#include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
+   N_("git commit-graph --write [--pack-dir ]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *pack_dir;
+   int write;
 } opts;
 
+static int graph_write(void)
+{
+   struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
+
+   if (graph_hash)
+   printf("%s\n", oid_to_hex(graph_hash));
+
+   free(graph_hash);
+   return 0;
+}
+
 int cmd_commit_graph(int argc, const char **argv, const char *prefix)
 {
static struct option builtin_commit_graph_options[] = {
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('w', "write", ,
+   N_("write commit graph file")),
OPT_END(),
};
 
@@ -29,5 +45,19 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
usage_with_options(builtin_commit_graph_usage,
   builtin_commit_graph_options);
 
+   argc = parse_options(argc, argv, prefix,
+builtin_commit_graph_options,
+builtin_commit_graph_usage, 0);
+
+   if (!opts.pack_dir) {
+   struct strbuf path = STRBUF_INIT;
+   strbuf_addstr(, get_object_directory());
+   strbuf_addstr(, "/pack");
+   opts.pack_dir = strbuf_detach(, NULL);
+   }
+
+   if (opts.write)
+   return graph_write();
+
return 0;
 }
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 00..6bcd1cc264
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,96 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' \
+'rm -rf .git &&
+ mkdir full &&
+ cd full &&
+ git init &&
+ git config core.commitgraph true &&
+ git config pack.threads 1 &&
+ packdir=".git/objects/pack"'
+
+test_expect_success 'write graph with no packs' \
+'git commit-graph --write --pack-dir .'
+
+test_expect_success 'create commits and repack' \
+'for i in $(test_seq 5)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+test_expect_success 'write graph' \
+'graph1=$(git commit-graph --write) &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph'
+
+t_expect_success 'Add more commits' \
+'git reset --hard commits/3 &&
+ for i in $(test_seq 6 10)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git reset --hard commits/3 &&
+ for i in $(test_seq 11 15)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git reset --hard commits/7 &&
+ git merge commits/11 &&
+ git branch merge/1 &&
+ git reset --hard commits

[PATCH v2 07/14] commit-graph: implement git-commit-graph --update-head

2018-01-30 Thread Derrick Stolee
It is possible to have multiple commit graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-commit-graph to write 'graph_head' upon
writing a new commit graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 34 ++
 builtin/commit-graph.c | 38 +++---
 commit-graph.c | 25 +
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh| 12 ++--
 5 files changed, 106 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 09aeaf6c82..99ced16ddc 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -12,15 +12,49 @@ SYNOPSIS
 'git commit-graph' --write  [--pack-dir ]
 'git commit-graph' --read  [--pack-dir ]
 
+OPTIONS
+---
+--pack-dir::
+   Use given directory for the location of packfiles, graph-head,
+   and graph files.
+
+--read::
+   Read a graph file given by the graph-head file and output basic
+   details about the graph file. (Cannot be combined with --write.)
+
+--graph-id::
+   When used with --read, consider the graph file graph-.graph.
+
+--write::
+   Write a new graph file to the pack directory. (Cannot be combined
+   with --read.)
+
+--update-head::
+   When used with --write, update the graph-head file to point to
+   the written graph file.
+
 EXAMPLES
 
 
+* Output the hash of the graph file pointed to by /graph-head.
++
+
+$ git commit-graph --pack-dir=
+
+
 * Write a commit graph file for the packed commits in your local .git folder.
 +
 
 $ git commit-graph --write
 
 
+* Write a graph file for the packed commits in your local .git folder,
+* and update graph-head.
++
+
+$ git commit-graph --write --update-head
+
+
 * Read basic information from a graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 218740b1f8..d73cbc907d 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -11,7 +11,7 @@
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
N_("git commit-graph --read [--graph-hash=]"),
-   N_("git commit-graph --write [--pack-dir ]"),
+   N_("git commit-graph --write [--pack-dir ] [--update-head]"),
NULL
 };
 
@@ -20,6 +20,9 @@ static struct opts_commit_graph {
int read;
const char *graph_hash;
int write;
+   int update_head;
+   int has_existing;
+   struct object_id old_graph_hash;
 } opts;
 
 static int graph_read(void)
@@ -30,8 +33,8 @@ static int graph_read(void)
 
if (opts.graph_hash && strlen(opts.graph_hash) == GIT_MAX_HEXSZ)
get_oid_hex(opts.graph_hash, _hash);
-   else
-   die("no graph hash specified");
+   else if (!get_graph_head_hash(opts.pack_dir, _hash))
+   die("no graph-head exists");
 
graph_file = get_commit_graph_filename_hash(opts.pack_dir, _hash);
graph = load_commit_graph_one(graph_file, opts.pack_dir);
@@ -62,10 +65,33 @@ static int graph_read(void)
return 0;
 }
 
+static void update_head_file(const char *pack_dir, const struct object_id 
*graph_hash)
+{
+   struct strbuf head_path = STRBUF_INIT;
+   int fd;
+   struct lock_file lk = LOCK_INIT;
+
+   strbuf_addstr(_path, pack_dir);
+   strbuf_addstr(_path, "/");
+   strbuf_addstr(_path, "graph-head");
+
+   fd = hold_lock_file_for_update(, head_path.buf, LOCK_DIE_ON_ERROR);
+   strbuf_release(_path);
+
+   if (fd < 0)
+   die_errno("unable to open graph-head");
+
+   write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ);
+   commit_lock_file();
+}
+
 static int graph_write(void)
 {
struct object_id *graph_hash = construct_commit_graph(opts.pack_dir);
 
+   if (opts.update_head)
+   update_head_file(opts.pack_dir, graph_hash);
+
if (graph_hash)
printf("%s\n", oid_to_hex(graph_hash));
 
@@ -83,6 +109,8 @@ int cmd_commit_graph(int argc, const char **argv, const char 
*prefix)
N_("read graph file")),
OPT_BOOL('w', "write", ,
N_("write commit graph file")),
+   OPT_BOOL('u', &quo

[PATCH v2 13/14] commit-graph: close under reachability

2018-01-30 Thread Derrick Stolee
Teach construct_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c  | 26 ++
 t/t5318-commit-graph.sh | 14 ++
 2 files changed, 40 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index e5a1d9ee8b..cfa0415a21 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -5,6 +5,7 @@
 #include "packfile.h"
 #include "commit.h"
 #include "object.h"
+#include "revision.h"
 #include "commit-graph.h"
 
 #define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
@@ -638,6 +639,29 @@ static int if_packed_commit_add_to_list(const struct 
object_id *oid,
return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct rev_info revs;
+   struct commit *commit;
+   init_revisions(, NULL);
+
+   for (i = 0; i < oids->num; i++) {
+   commit = lookup_commit(oids->list[i]);
+   if (commit && !parse_commit(commit))
+   revs.commits = commit_list_insert(commit, 
);
+   }
+
+   if (prepare_revision_walk())
+   die(_("revision walk setup failed"));
+
+   while ((commit = get_revision()) != NULL) {
+   ALLOC_GROW(oids->list, oids->num + 1, oids->size);
+   oids->list[oids->num] = &(commit->object.oid);
+   (oids->num)++;
+   }
+}
+
 struct object_id *construct_commit_graph(const char *pack_dir,
 char **pack_indexes,
 int nr_packs)
@@ -696,6 +720,8 @@ struct object_id *construct_commit_graph(const char 
*pack_dir,
} else {
for_each_packed_object(if_packed_commit_add_to_list, , 0);
}
+
+   close_reachable();
QSORT(oids.list, oids.num, commit_compare);
 
count_distinct = 1;
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index b9a73f398c..2001b0b5b5 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -213,6 +213,20 @@ test_expect_success 'clear graph' \
 _graph_git_behavior commits/20 merge/1
 _graph_git_behavior commits/20 merge/2
 
+test_expect_success 'build graph from latest pack with closure' \
+'graph5=$(cat new-idx | git commit-graph --write --update-head 
--stdin-packs) &&
+ test_path_is_file ${packdir}/graph-${graph5}.graph &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph &&
+ test_path_is_file ${packdir}/graph-head &&
+ echo ${graph5} >expect &&
+ cmp -n 40 expect ${packdir}/graph-head &&
+ git commit-graph --read --graph-hash=${graph5} >output &&
+ _graph_read_expect "21" "${packdir}" &&
+ cmp expect output'
+
+_graph_git_behavior commits/20 merge/1
+_graph_git_behavior commits/20 merge/2
+
 test_expect_success 'setup bare repo' \
 'cd .. &&
  git clone --bare full bare &&
-- 
2.16.0.15.g9c3cf44.dirty



[PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-01-30 Thread Derrick Stolee
Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Makefile   |   1 +
 commit-graph.c | 376 +
 commit-graph.h |  20 +++
 3 files changed, 397 insertions(+)
 create mode 100644 commit-graph.c
 create mode 100644 commit-graph.h

diff --git a/Makefile b/Makefile
index aee5d3f7b9..894432b35b 100644
--- a/Makefile
+++ b/Makefile
@@ -773,6 +773,7 @@ LIB_OBJS += color.o
 LIB_OBJS += column.o
 LIB_OBJS += combine-diff.o
 LIB_OBJS += commit.o
+LIB_OBJS += commit-graph.o
 LIB_OBJS += compat/obstack.o
 LIB_OBJS += compat/terminal.o
 LIB_OBJS += config.o
diff --git a/commit-graph.c b/commit-graph.c
new file mode 100644
index 00..db2b7390c7
--- /dev/null
+++ b/commit-graph.c
@@ -0,0 +1,376 @@
+#include "cache.h"
+#include "config.h"
+#include "git-compat-util.h"
+#include "pack.h"
+#include "packfile.h"
+#include "commit.h"
+#include "object.h"
+#include "commit-graph.h"
+
+#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
+#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
+#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
+#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
+#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */
+
+#define GRAPH_DATA_WIDTH 36
+
+#define GRAPH_VERSION_1 0x1
+#define GRAPH_VERSION GRAPH_VERSION_1
+
+#define GRAPH_OID_VERSION_SHA1 1
+#define GRAPH_OID_LEN_SHA1 20
+#define GRAPH_OID_VERSION GRAPH_OID_VERSION_SHA1
+#define GRAPH_OID_LEN GRAPH_OID_LEN_SHA1
+
+#define GRAPH_LARGE_EDGES_NEEDED 0x8000
+#define GRAPH_PARENT_MISSING 0x7fff
+#define GRAPH_EDGE_LAST_MASK 0x7fff
+#define GRAPH_PARENT_NONE 0x7000
+
+#define GRAPH_LAST_EDGE 0x8000
+
+#define GRAPH_FANOUT_SIZE (4*256)
+#define GRAPH_CHUNKLOOKUP_SIZE (5 * 12)
+#define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
+   GRAPH_OID_LEN + sizeof(struct commit_graph_header))
+
+char* get_commit_graph_filename_hash(const char *pack_dir,
+struct object_id *hash)
+{
+   size_t len;
+   struct strbuf head_path = STRBUF_INIT;
+   strbuf_addstr(_path, pack_dir);
+   strbuf_addstr(_path, "/graph-");
+   strbuf_addstr(_path, oid_to_hex(hash));
+   strbuf_addstr(_path, ".graph");
+
+   return strbuf_detach(_path, );
+}
+
+static void write_graph_chunk_fanout(struct sha1file *f,
+struct commit **commits,
+int nr_commits)
+{
+   uint32_t i, count = 0;
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+
+   /*
+* Write the first-level table (the list is sorted,
+* but we use a 256-entry lookup to be able to avoid
+* having to do eight extra binary search iterations).
+*/
+   for (i = 0; i < 256; i++) {
+   uint32_t swap_count;
+
+   while (list < last) {
+   if ((*list)->object.oid.hash[0] != i)
+   break;
+   count++;
+   list++;
+   }
+
+   swap_count = htonl(count);
+   sha1write(f, _count, 4);
+   }
+}
+
+static void write_graph_chunk_oids(struct sha1file *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list, **last = commits + nr_commits;
+   for (list = commits; list < last; list++)
+   sha1write(f, (*list)->object.oid.hash, (int)hash_len);
+}
+
+static int commit_pos(struct commit **commits, int nr_commits,
+ const struct object_id *oid, uint32_t *pos)
+{
+   uint32_t first = 0, last = nr_commits;
+
+   while (first < last) {
+   uint32_t mid = first + (last - first) / 2;
+   struct object_id *current;
+   int cmp;
+
+   current = &(commits[mid]->object.oid);
+   cmp = oidcmp(oid, current);
+   if (!cmp) {
+   *pos = mid;
+   return 1;
+   }
+   if (cmp > 0) {
+   first = mid + 1;
+   continue;
+   }
+   last = mid;
+   }
+
+   *pos = first;
+   return 0;
+}
+
+static void write_graph_chunk_data(struct sha1file *f, int hash_len,
+  struct commit **commits, int nr_commits)
+{
+   struct commit **list = commits;
+   struct commit **last = commits + nr_commits;
+   uint32_t num_large_edges = 0;
+
+   while (list < last) {
+   s

[PATCH v2 14/14] commit-graph: build graph from starting commits

2018-01-30 Thread Derrick Stolee
Teach git-commit-graph to read commits from stdin when the
--stdin-commits flag is specified. Commits reachable from these
commits are added to the graph. This is a much faster way to construct
the graph than inspecting all packed objects, but is restricted to
known tips.

For the Linux repository, 700,000+ commits were added to the graph
file starting from 'master' in 7-9 seconds, depending on the number
of packfiles in the repo (1, 24, or 120).

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  7 ++-
 builtin/commit-graph.c | 34 +-
 commit-graph.c | 26 +++---
 commit-graph.h |  4 +++-
 t/t5318-commit-graph.sh| 18 ++
 5 files changed, 75 insertions(+), 14 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index d0571cd896..3357c0cf8f 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -46,7 +46,12 @@ OPTIONS
 --stdin-packs::
When used with --write, generate the new graph by walking objects
only in the specified packfiles and any commits in the
-   existing graph-head.
+   existing graph-head. (Cannot be combined with --stdin-commits.)
+
+--stdin-commits::
+   When used with --write, generate the new graph by walking commits
+   starting at the commits specified in stdin as a list of OIDs in
+   hex, one OID per line. (Cannot be combined with --stdin-packs.)
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 80a409e784..adc05f0582 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -12,7 +12,7 @@ static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
N_("git commit-graph --clear [--pack-dir ]"),
N_("git commit-graph --read [--graph-hash=]"),
-   N_("git commit-graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs]"),
+   N_("git commit-graph --write [--pack-dir ] [--update-head] 
[--delete-expired] [--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -25,6 +25,7 @@ static struct opts_commit_graph {
int update_head;
int delete_expired;
int stdin_packs;
+   int stdin_commits;
int has_existing;
struct object_id old_graph_hash;
 } opts;
@@ -117,23 +118,36 @@ static int graph_write(void)
 {
struct object_id *graph_hash;
char **pack_indexes = NULL;
+   char **commits = NULL;
int num_packs = 0;
-   int size_packs = 0;
+   int num_commits = 0;
+   char **lines = NULL;
+   int num_lines = 0;
+   int size_lines = 0;
 
-   if (opts.stdin_packs) {
+   if (opts.stdin_packs || opts.stdin_commits) {
struct strbuf buf = STRBUF_INIT;
-   size_packs = 128;
-   ALLOC_ARRAY(pack_indexes, size_packs);
+   size_lines = 128;
+   ALLOC_ARRAY(lines, size_lines);
 
while (strbuf_getline(, stdin) != EOF) {
-   ALLOC_GROW(pack_indexes, num_packs + 1, size_packs);
-   pack_indexes[num_packs++] = buf.buf;
+   ALLOC_GROW(lines, num_lines + 1, size_lines);
+   lines[num_lines++] = buf.buf;
strbuf_detach(, NULL);
}
-   }
 
-   graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, 
num_packs);
+   if (opts.stdin_packs) {
+   pack_indexes = lines;
+   num_packs = num_lines;
+   }
+   if (opts.stdin_commits) {
+   commits = lines;
+   num_commits = num_lines;
+   }
+   }
 
+   graph_hash = construct_commit_graph(opts.pack_dir, pack_indexes, 
num_packs,
+   commits, num_commits);
if (opts.update_head)
update_head_file(opts.pack_dir, graph_hash);
 
@@ -172,6 +186,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
N_("delete expired head graph file")),
OPT_BOOL('s', "stdin-packs", _packs,
N_("only scan packfiles listed by stdin")),
+   OPT_BOOL('C', "stdin-commits", _commits,
+   N_("start walk at commits listed by stdin")),
{ OPTION_STRING, 'H', "graph-hash", _hash,
N_("hash"),
N_("A hash for a specific graph file in the pack-dir."),
diff --git a/commit-graph.c b/commit-graph.c
index cfa0415a21..7f31a6c795 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -66

Re: [PATCH v2 10/27] protocol: introduce enum protocol_version value protocol_v2

2018-01-31 Thread Derrick Stolee

On 1/25/2018 6:58 PM, Brandon Williams wrote:

Introduce protocol_v2, a new value for 'enum protocol_version'.
Subsequent patches will fill in the implementation of protocol_v2.

Signed-off-by: Brandon Williams 
---
  builtin/fetch-pack.c   | 3 +++
  builtin/receive-pack.c | 6 ++
  builtin/send-pack.c| 3 +++
  builtin/upload-pack.c  | 7 +++
  connect.c  | 3 +++
  protocol.c | 2 ++
  protocol.h | 1 +
  remote-curl.c  | 3 +++
  transport.c| 9 +
  9 files changed, 37 insertions(+)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 85d4faf76..f492e8abd 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -201,6 +201,9 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
   PACKET_READ_GENTLE_ON_EOF);
  
  	switch (discover_version()) {

+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
case protocol_v0:
get_remote_heads(, , 0, NULL, );
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index b7ce7c7f5..3656e94fd 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1963,6 +1963,12 @@ int cmd_receive_pack(int argc, const char **argv, const 
char *prefix)
unpack_limit = receive_unpack_limit;
  
  	switch (determine_protocol_version_server()) {

+   case protocol_v2:
+   /*
+* push support for protocol v2 has not been implemented yet,
+* so ignore the request to use v2 and fallback to using v0.
+*/
+   break;
case protocol_v1:
/*
 * v1 is just the original protocol with a version string,
diff --git a/builtin/send-pack.c b/builtin/send-pack.c
index 83cb125a6..b5427f75e 100644
--- a/builtin/send-pack.c
+++ b/builtin/send-pack.c
@@ -263,6 +263,9 @@ int cmd_send_pack(int argc, const char **argv, const char 
*prefix)
   PACKET_READ_GENTLE_ON_EOF);
  
  	switch (discover_version()) {

+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
case protocol_v0:
get_remote_heads(, _refs, REF_NORMAL,
diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c
index 2cb5cb35b..8d53e9794 100644
--- a/builtin/upload-pack.c
+++ b/builtin/upload-pack.c
@@ -47,6 +47,13 @@ int cmd_upload_pack(int argc, const char **argv, const char 
*prefix)
die("'%s' does not appear to be a git repository", dir);
  
  	switch (determine_protocol_version_server()) {

+   case protocol_v2:
+   /*
+* fetch support for protocol v2 has not been implemented yet,
+* so ignore the request to use v2 and fallback to using v0.
+*/
+   upload_pack();
+   break;
case protocol_v1:
/*
 * v1 is just the original protocol with a version string,
diff --git a/connect.c b/connect.c
index db3c9d24c..f2157a821 100644
--- a/connect.c
+++ b/connect.c
@@ -84,6 +84,9 @@ enum protocol_version discover_version(struct packet_reader 
*reader)
  
  	/* Maybe process capabilities here, at least for v2 */

switch (version) {
+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
/* Read the peeked version line */
packet_reader_read(reader);
diff --git a/protocol.c b/protocol.c
index 43012b7eb..5e636785d 100644
--- a/protocol.c
+++ b/protocol.c
@@ -8,6 +8,8 @@ static enum protocol_version parse_protocol_version(const char 
*value)
return protocol_v0;
else if (!strcmp(value, "1"))
return protocol_v1;
+   else if (!strcmp(value, "2"))
+   return protocol_v2;
else
return protocol_unknown_version;
  }
diff --git a/protocol.h b/protocol.h
index 1b2bc94a8..2ad35e433 100644
--- a/protocol.h
+++ b/protocol.h
@@ -5,6 +5,7 @@ enum protocol_version {
protocol_unknown_version = -1,
protocol_v0 = 0,
protocol_v1 = 1,
+   protocol_v2 = 2,
  };
  
  /*

diff --git a/remote-curl.c b/remote-curl.c
index 9f6d07683..dae8a4a48 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -185,6 +185,9 @@ static struct ref *parse_git_refs(struct discovery *heads, 
int for_push)
   PACKET_READ_GENTLE_ON_EOF);
  
  	switch (discover_version()) {

+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
case protocol_v0:
get_remote_heads(, , for_push ? REF_NORMAL : 0,
diff --git a/transport.c b/transport.c
index 2378dcb38..83d9dd1df 100644
--- 

Re: [PATCH v2 09/27] transport: store protocol version

2018-01-31 Thread Derrick Stolee

On 1/25/2018 6:58 PM, Brandon Williams wrote:

+   switch (data->version) {
+   case protocol_v1:
+   case protocol_v0:
+   refs = fetch_pack(, data->fd, data->conn,
+ refs_tmp ? refs_tmp : transport->remote_refs,
+ dest, to_fetch, nr_heads, >shallow,
+ >pack_lockfile);
+   break;
+   case protocol_unknown_version:
+   BUG("unknown protocol version");
+   }


After seeing this pattern a few times, I think it would be good to 
convert it to a macro that calls a statement for protocol_v1/v0 (and 
later calls a different one for protocol_v2). It would at minimum reduce 
the code clones surrounding this handling of unknown_version, and we 
could have one place that is clear this BUG() is due to an unexpected 
response from discover_version().




Re: [PATCH v2 05/27] upload-pack: factor out processing lines

2018-01-31 Thread Derrick Stolee

On 1/26/2018 4:33 PM, Brandon Williams wrote:

On 01/26, Stefan Beller wrote:

On Thu, Jan 25, 2018 at 3:58 PM, Brandon Williams  wrote:

Factor out the logic for processing shallow, deepen, deepen_since, and
deepen_not lines into their own functions to simplify the
'receive_needs()' function in addition to making it easier to reuse some
of this logic when implementing protocol_v2.

Signed-off-by: Brandon Williams 
---
  upload-pack.c | 113 ++
  1 file changed, 74 insertions(+), 39 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 2ad73a98b..42d83d5b1 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -724,6 +724,75 @@ static void deepen_by_rev_list(int ac, const char **av,
 packet_flush(1);
  }

+static int process_shallow(const char *line, struct object_array *shallows)
+{
+   const char *arg;
+   if (skip_prefix(line, "shallow ", )) {

stylistic nit:

 You could invert the condition in each of the process_* functions
 to just have

 if (!skip_prefix...))
 return 0

 /* less indented code goes here */

 return 1;

 That way we have less indentation as well as easier code.
 (The reader doesn't need to keep in mind what the else
 part is about; it is a rather local decision to bail out instead
 of having the return at the end of the function.)

I was trying to move the existing code into helper functions so
rewriting them in transit may make it less reviewable?


I think the way you kept to the existing code as much as possible is 
good and easier to review. Perhaps a style pass after the patch lands is 
good for #leftoverbits.





+   struct object_id oid;
+   struct object *object;
+   if (get_oid_hex(arg, ))
+   die("invalid shallow line: %s", line);
+   object = parse_object();
+   if (!object)
+   return 1;
+   if (object->type != OBJ_COMMIT)
+   die("invalid shallow object %s", oid_to_hex());
+   if (!(object->flags & CLIENT_SHALLOW)) {
+   object->flags |= CLIENT_SHALLOW;
+   add_object_array(object, NULL, shallows);
+   }
+   return 1;
+   }
+
+   return 0;
+}
+
+static int process_deepen(const char *line, int *depth)
+{
+   const char *arg;
+   if (skip_prefix(line, "deepen ", )) {
+   char *end = NULL;
+   *depth = (int) strtol(arg, , 0);


nit: space between (int) and strtol?


+   if (!end || *end || *depth <= 0)
+   die("Invalid deepen: %s", line);
+   return 1;
+   }
+
+   return 0;
+}
+
+static int process_deepen_since(const char *line, timestamp_t *deepen_since, 
int *deepen_rev_list)
+{
+   const char *arg;
+   if (skip_prefix(line, "deepen-since ", )) {
+   char *end = NULL;
+   *deepen_since = parse_timestamp(arg, , 0);
+   if (!end || *end || !deepen_since ||
+   /* revisions.c's max_age -1 is special */
+   *deepen_since == -1)
+   die("Invalid deepen-since: %s", line);
+   *deepen_rev_list = 1;
+   return 1;
+   }
+   return 0;
+}
+
+static int process_deepen_not(const char *line, struct string_list 
*deepen_not, int *deepen_rev_list)
+{
+   const char *arg;
+   if (skip_prefix(line, "deepen-not ", )) {
+   char *ref = NULL;
+   struct object_id oid;
+   if (expand_ref(arg, strlen(arg), , ) != 1)
+   die("git upload-pack: ambiguous deepen-not: %s", line);
+   string_list_append(deepen_not, ref);
+   free(ref);
+   *deepen_rev_list = 1;
+   return 1;
+   }
+   return 0;
+}
+
  static void receive_needs(void)
  {
 struct object_array shallows = OBJECT_ARRAY_INIT;
@@ -745,49 +814,15 @@ static void receive_needs(void)
 if (!line)
 break;

-   if (skip_prefix(line, "shallow ", )) {
-   struct object_id oid;
-   struct object *object;
-   if (get_oid_hex(arg, ))
-   die("invalid shallow line: %s", line);
-   object = parse_object();
-   if (!object)
-   continue;
-   if (object->type != OBJ_COMMIT)
-   die("invalid shallow object %s", 
oid_to_hex());
-   if (!(object->flags & CLIENT_SHALLOW)) {
-   object->flags |= CLIENT_SHALLOW;
-   add_object_array(object, NULL, );
-   }
+   if (process_shallow(line, ))
 

Re: [PATCH v2 08/27] connect: discover protocol version outside of get_remote_heads

2018-01-31 Thread Derrick Stolee

On 1/25/2018 6:58 PM, Brandon Williams wrote:

In order to prepare for the addition of protocol_v2 push the protocol
version discovery outside of 'get_remote_heads()'.  This will allow for
keeping the logic for processing the reference advertisement for
protocol_v1 and protocol_v0 separate from the logic for protocol_v2.

Signed-off-by: Brandon Williams 
---
  builtin/fetch-pack.c | 16 +++-
  builtin/send-pack.c  | 17 +++--
  connect.c| 27 ++-
  connect.h|  3 +++
  remote-curl.c| 20 ++--
  remote.h |  5 +++--
  transport.c  | 24 +++-
  7 files changed, 83 insertions(+), 29 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 366b9d13f..85d4faf76 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -4,6 +4,7 @@
  #include "remote.h"
  #include "connect.h"
  #include "sha1-array.h"
+#include "protocol.h"
  
  static const char fetch_pack_usage[] =

  "git fetch-pack [--all] [--stdin] [--quiet | -q] [--keep | -k] [--thin] "
@@ -52,6 +53,7 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
struct fetch_pack_args args;
struct oid_array shallow = OID_ARRAY_INIT;
struct string_list deepen_not = STRING_LIST_INIT_DUP;
+   struct packet_reader reader;
  
  	packet_trace_identity("fetch-pack");
  
@@ -193,7 +195,19 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)

if (!conn)
return args.diag_url ? 0 : 1;
}
-   get_remote_heads(fd[0], NULL, 0, , 0, NULL, );
+
+   packet_reader_init(, fd[0], NULL, 0,
+  PACKET_READ_CHOMP_NEWLINE |
+  PACKET_READ_GENTLE_ON_EOF);
+
+   switch (discover_version()) {
+   case protocol_v1:
+   case protocol_v0:
+   get_remote_heads(, , 0, NULL, );
+   break;
+   case protocol_unknown_version:
+   BUG("unknown protocol version");


Is this really a BUG in the client, or a bug/incompatibility in the server?

Perhaps I'm misunderstanding, but it looks like discover_version() will 
die() on an unknown version (the die() is in 
protocol.c:determine_protocol_version_client()). So maybe that's why 
this is a BUG()?


If there is something to change here, this BUG() appears three more times.


+   }
  
  	ref = fetch_pack(, fd, conn, ref, dest, sought, nr_sought,

 , pack_lockfile_ptr);
diff --git a/builtin/send-pack.c b/builtin/send-pack.c
index fc4f0bb5f..83cb125a6 100644
--- a/builtin/send-pack.c
+++ b/builtin/send-pack.c
@@ -14,6 +14,7 @@
  #include "sha1-array.h"
  #include "gpg-interface.h"
  #include "gettext.h"
+#include "protocol.h"
  
  static const char * const send_pack_usage[] = {

N_("git send-pack [--all | --mirror] [--dry-run] [--force] "
@@ -154,6 +155,7 @@ int cmd_send_pack(int argc, const char **argv, const char 
*prefix)
int progress = -1;
int from_stdin = 0;
struct push_cas_option cas = {0};
+   struct packet_reader reader;
  
  	struct option options[] = {

OPT__VERBOSITY(),
@@ -256,8 +258,19 @@ int cmd_send_pack(int argc, const char **argv, const char 
*prefix)
args.verbose ? CONNECT_VERBOSE : 0);
}
  
-	get_remote_heads(fd[0], NULL, 0, _refs, REF_NORMAL,

-_have, );
+   packet_reader_init(, fd[0], NULL, 0,
+  PACKET_READ_CHOMP_NEWLINE |
+  PACKET_READ_GENTLE_ON_EOF);
+
+   switch (discover_version()) {
+   case protocol_v1:
+   case protocol_v0:
+   get_remote_heads(, _refs, REF_NORMAL,
+_have, );
+   break;
+   case protocol_unknown_version:
+   BUG("unknown protocol version");
+   }
  
  	transport_verify_remote_names(nr_refspecs, refspecs);
  
diff --git a/connect.c b/connect.c

index 00e90075c..db3c9d24c 100644
--- a/connect.c
+++ b/connect.c
@@ -62,7 +62,7 @@ static void die_initial_contact(int unexpected)
  "and the repository exists."));
  }
  
-static enum protocol_version discover_version(struct packet_reader *reader)

+enum protocol_version discover_version(struct packet_reader *reader)
  {
enum protocol_version version = protocol_unknown_version;
  
@@ -234,7 +234,7 @@ enum get_remote_heads_state {

  /*
   * Read all the refs from the other end
   */
-struct ref **get_remote_heads(int in, char *src_buf, size_t src_len,
+struct ref **get_remote_heads(struct packet_reader *reader,
  struct ref **list, unsigned int flags,
  struct oid_array *extra_have,
  struct oid_array *shallow_points)
@@ -242,24 +242,17 @@ struct ref **get_remote_heads(int in, char *src_buf, 
size_t 

Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write

2018-02-02 Thread Derrick Stolee

On 2/2/2018 5:48 PM, Junio C Hamano wrote:

Stefan Beller  writes:


It is true for git-submodule and a few others (the minority of commands IIRC)
git-tag for example takes subcommands such as --list or --verify.
https://public-inbox.org/git/xmqqiomodkt9@gitster.dls.corp.google.com/


Thanks.  It refers to an article at gmane, which is not readily
accessible unless you use newsreader.  The original discussion it
refers to appears at:

 https://public-inbox.org/git/7vbo5itjfl@alter.siamese.dyndns.org/

for those who are interested.


Thanks for the links.


I am still not sure if it is a good design to add a new command like
this series does, though.  I would naively have expected that this
would be a new pack index format that is produced by pack-objects
and index-pack, for example, in which case its maintenance would
almost be invisible to end users (i.e. just like how the pack bitmap
feature was added to the system).


I agree that the medium-term goal is to have this happen without user 
intervention. Something like a "core.autoCommitGraph" setting to trigger 
commit-graph writes during other cleanup activities, such as a repack or 
a gc.


I don't think pairing this with pack-objects or index-pack is a good 
direction, because the commit graph is not locked into a packfile the 
way the bitmap is. In fact, the entire ODB could be replaced 
independently and the graph is still valid (the commits in the graph may 
no longer have their paired commits in the ODB due to a GC; you should 
never navigate to those commits without having a ref pointing to them, 
so this is not immediately a problem).


This sort of interaction with GC is one reason why I did not include the 
automatic updates in this patch. The integration with existing 
maintenance tasks will be worth discussion in its own right. I'd rather 
demonstrate the value of having a graph (even if it is currently 
maintained manually) and then follow up with a focus to integrate with 
repack, gc, etc.


I plan to clean up this patch on Monday given the feedback I received 
the last two days (Thanks Jonathan and Szeder!). However, if the current 
builtin design will block merging, then I'll wait until we can find one 
that works.


Thanks,
-Stolee


Re: [PATCH 0/2] Refactor hash search with fanout table

2018-02-02 Thread Derrick Stolee

On 2/2/2018 6:30 PM, Junio C Hamano wrote:

Jonathan Tan <jonathanta...@google.com> writes:


After reviewing Derrick's Serialized Git Commit Graph patches [1], I
noticed that "[PATCH v2 11/14] commit: integrate commit graph with
commit parsing" contains (in bsearch_graph) a repeat of some packfile
functionality. Here is a pack that refactors that functionality out.


Yay.  I had exactly the same reaction to that part of the series.



Thanks for doing this refactor. I'm a fan of reducing code clones, but 
also don't want to break well-worn code paths.


Jonathan: While you are doing this, I'm guessing you could use your new 
method to replace (and maybe speed up) the binary search in 
sha1_name.c:find_abbrev_len_for_pack(). Otherwise, I can take a stab at 
it next week.


Please add
Reviewed-by: Derrick Stolee <dsto...@microsoft.com>

Thanks,
-Stolee


Re: [PATCH v2 12/27] serve: introduce git-serve

2018-01-31 Thread Derrick Stolee

On 1/25/2018 6:58 PM, Brandon Williams wrote:

Introduce git-serve, the base server for protocol version 2.

Protocol version 2 is intended to be a replacement for Git's current
wire protocol.  The intention is that it will be a simpler, less
wasteful protocol which can evolve over time.

Protocol version 2 improves upon version 1 by eliminating the initial
ref advertisement.  In its place a server will export a list of
capabilities and commands which it supports in a capability
advertisement.  A client can then request that a particular command be
executed by providing a number of capabilities and command specific
parameters.  At the completion of a command, a client can request that
another command be executed or can terminate the connection by sending a
flush packet.

Signed-off-by: Brandon Williams 
---
  .gitignore  |   1 +
  Documentation/technical/protocol-v2.txt | 117 +++
  Makefile|   2 +
  builtin.h   |   1 +
  builtin/serve.c |  30 
  git.c   |   1 +
  serve.c | 249 
  serve.h |  15 ++
  t/t5701-git-serve.sh|  56 +++
  9 files changed, 472 insertions(+)
  create mode 100644 Documentation/technical/protocol-v2.txt
  create mode 100644 builtin/serve.c
  create mode 100644 serve.c
  create mode 100644 serve.h
  create mode 100755 t/t5701-git-serve.sh

diff --git a/.gitignore b/.gitignore
index 833ef3b0b..2d0450c26 100644
--- a/.gitignore
+++ b/.gitignore
@@ -140,6 +140,7 @@
  /git-rm
  /git-send-email
  /git-send-pack
+/git-serve
  /git-sh-i18n
  /git-sh-i18n--envsubst
  /git-sh-setup
diff --git a/Documentation/technical/protocol-v2.txt 
b/Documentation/technical/protocol-v2.txt
new file mode 100644
index 0..7f619a76c
--- /dev/null
+++ b/Documentation/technical/protocol-v2.txt
@@ -0,0 +1,117 @@
+ Git Wire Protocol, Version 2
+==
+
+This document presents a specification for a version 2 of Git's wire
+protocol.  Protocol v2 will improve upon v1 in the following ways:
+
+  * Instead of multiple service names, multiple commands will be
+supported by a single service.


As someone unfamiliar with the old protocol code, this statement is 
underselling the architectural significance of your change. The new 
model allows a single service to handle all different wire protocols 
(git://, ssh://, https://) while being agnostic to the command-specific 
logic. It also hides the protocol negotiation away from these consumers.


The ease with which you are adding new commands in later commits really 
demonstrates the value of this patch. To make that point here, you would 
almost need to document the old model to show how it was difficult to 
use and extend. Perhaps this document will not need expanding since the 
code speaks for itself.


I just wanted to state for the record that the new architecture is a big 
improvement and will make more commands much easier to implement.



+  * Easily extendable as capabilities are moved into their own section
+of the protocol, no longer being hidden behind a NUL byte and
+limited by the size of a pkt-line (as there will be a single
+capability per pkt-line).
+  * Separate out other information hidden behind NUL bytes (e.g. agent
+string as a capability and symrefs can be requested using 'ls-refs')
+  * Reference advertisement will be omitted unless explicitly requested
+  * ls-refs command to explicitly request some refs
+


nit: some bullets have full stops (.) and others do not.


+ Detailed Design
+=
+
+A client can request to speak protocol v2 by sending `version=2` in the
+side-channel `GIT_PROTOCOL` in the initial request to the server.
+
+In protocol v2 communication is command oriented.  When first contacting a
+server a list of capabilities will advertised.  Some of these capabilities
+will be commands which a client can request be executed.  Once a command
+has completed, a client can reuse the connection and request that other
+commands be executed.
+
+ Special Packets
+-
+
+In protocol v2 these special packets will have the following semantics:
+
+  * '' Flush Packet (flush-pkt) - indicates the end of a message
+  * '0001' Delimiter Packet (delim-pkt) - separates sections of a message
+
+ Capability Advertisement
+--
+
+A server which decides to communicate (based on a request from a client)
+using protocol version 2, notifies the client by sending a version string
+in its initial response followed by an advertisement of its capabilities.
+Each capability is a key with an optional value.  Clients must ignore all
+unknown keys.  Semantics of unknown values are left to the definition of
+each key.  Some capabilities will describe commands which can be 

Re: [PATCH v2 00/27] protocol version 2

2018-01-31 Thread Derrick Stolee
Sorry for chiming in with mostly nitpicks so late since sending this 
version. Mostly, I tried to read it to see if I could understand the 
scope of the patch and how this code worked before. It looks very 
polished, so I the nits were the best I could do.


On 1/25/2018 6:58 PM, Brandon Williams wrote:

Changes in v2:
  * Added documentation for fetch
  * changes #defines for state variables to be enums
  * couple code changes to pkt-line functions and documentation
  * Added unit tests for the git-serve binary as well as for ls-refs


I'm a fan of more unit-level testing, and I think that will be more 
important as we go on with these multiple configuration options.



Areas for improvement
  * Push isn't implemented, right now this is ok because if v2 is requested the
server can just default to v0.  Before this can be merged we may want to
change how the client request a new protocol, and not allow for sending
"version=2" when pushing even though the user has it configured.  Or maybe
its fine to just have an older client who doesn't understand how to push
(and request v2) to die if the server tries to speak v2 at it.

Fixing this essentially would just require piping through a bit more
information to the function which ultimately runs connect (for both builtins
and remote-curl)


Definitely save push for a later patch. Getting 'fetch' online did 
require 'ls-refs' at the same time. Future reviews will be easier when 
adding one command at a time.




  * I want to make sure that the docs are well written before this gets merged
so I'm hoping that someone can do a through review on the docs themselves to
make sure they are clear.


I made a comment in the docs about the architectural changes. While I 
think a discussion on that topic would be valuable, I'm not sure that's 
the point of the document (i.e. documenting what v2 does versus selling 
the value of the patch). I thought the docs were clear for how the 
commands work.



  * Right now there is a capability 'stateless-rpc' which essentially makes sure
that a server command completes after a single round (this is to make sure
http works cleanly).  After talking with some folks it may make more sense
to just have v2 be stateless in nature so that all commands terminate after
a single round trip.  This makes things a bit easier if a server wants to
have ssh just be a proxy for http.

One potential thing would be to flip this so that by default the protocol is
stateless and if a server/command has a state-full mode that can be
implemented as a capability at a later point.  Thoughts?


At minimum, all commands should be designed with a "stateless first" 
philosophy since a large number of users communicate via HTTP[S] and any 
decisions that make stateless communication painful should be rejected.



  * Shallow repositories and shallow clones aren't supported yet.  I'm working
on it and it can be either added to v2 by default if people think it needs
to be in there from the start, or we can add it as a capability at a later
point.


I'm happy to say the following:

1. Shallow repositories should not be used for servers, since they 
cannot service all requests.


2. Since v2 has easy capability features, I'm happy to leave shallow for 
later. We will want to verify that a shallow clone command reverts to v1.



I fetched bw/protocol-v2 with tip 13c70148, built, set 
'protocol.version=2' in the config, and tested fetches against GitHub 
and VSTS just as a compatibility test. Everything worked just fine.


Is there an easy way to test the existing test suite for clone and fetch 
using protocol v2 to make sure there are no regressions with 
protocol.version=2 in the config?


Thanks,
-Stolee


Re: [PATCH v2 14/27] connect: request remote refs using v2

2018-01-31 Thread Derrick Stolee

On 1/25/2018 6:58 PM, Brandon Williams wrote:

Teach the client to be able to request a remote's refs using protocol
v2.  This is done by having a client issue a 'ls-refs' request to a v2
server.

Signed-off-by: Brandon Williams 
---
  builtin/upload-pack.c  |  10 ++--
  connect.c  | 123 -
  remote.h   |   4 ++
  t/t5702-protocol-v2.sh |  28 +++
  transport.c|   2 +-
  5 files changed, 160 insertions(+), 7 deletions(-)
  create mode 100755 t/t5702-protocol-v2.sh

diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c
index 8d53e9794..a757df8da 100644
--- a/builtin/upload-pack.c
+++ b/builtin/upload-pack.c
@@ -5,6 +5,7 @@
  #include "parse-options.h"
  #include "protocol.h"
  #include "upload-pack.h"
+#include "serve.h"
  
  static const char * const upload_pack_usage[] = {

N_("git upload-pack [] "),
@@ -16,6 +17,7 @@ int cmd_upload_pack(int argc, const char **argv, const char 
*prefix)
const char *dir;
int strict = 0;
struct upload_pack_options opts = { 0 };
+   struct serve_options serve_opts = SERVE_OPTIONS_INIT;
struct option options[] = {
OPT_BOOL(0, "stateless-rpc", _rpc,
 N_("quit after a single request/response exchange")),
@@ -48,11 +50,9 @@ int cmd_upload_pack(int argc, const char **argv, const char 
*prefix)
  
  	switch (determine_protocol_version_server()) {

case protocol_v2:
-   /*
-* fetch support for protocol v2 has not been implemented yet,
-* so ignore the request to use v2 and fallback to using v0.
-*/
-   upload_pack();
+   serve_opts.advertise_capabilities = opts.advertise_refs;
+   serve_opts.stateless_rpc = opts.stateless_rpc;
+   serve(_opts);
break;
case protocol_v1:
/*
diff --git a/connect.c b/connect.c
index f2157a821..3c653b65b 100644
--- a/connect.c
+++ b/connect.c
@@ -12,9 +12,11 @@
  #include "sha1-array.h"
  #include "transport.h"
  #include "strbuf.h"
+#include "version.h"
  #include "protocol.h"
  
  static char *server_capabilities;

+static struct argv_array server_capabilities_v2 = ARGV_ARRAY_INIT;
  static const char *parse_feature_value(const char *, const char *, int *);
  
  static int check_ref(const char *name, unsigned int flags)

@@ -62,6 +64,33 @@ static void die_initial_contact(int unexpected)
  "and the repository exists."));
  }
  
+/* Checks if the server supports the capability 'c' */

+static int server_supports_v2(const char *c, int die_on_error)
+{
+   int i;
+
+   for (i = 0; i < server_capabilities_v2.argc; i++) {
+   const char *out;
+   if (skip_prefix(server_capabilities_v2.argv[i], c, ) &&
+   (!*out || *out == '='))
+   return 1;
+   }
+
+   if (die_on_error)
+   die("server doesn't support '%s'", c);
+
+   return 0;
+}
+
+static void process_capabilities_v2(struct packet_reader *reader)
+{
+   while (packet_reader_read(reader) == PACKET_READ_NORMAL)
+   argv_array_push(_capabilities_v2, reader->line);
+
+   if (reader->status != PACKET_READ_FLUSH)
+   die("protocol error");
+}
+
  enum protocol_version discover_version(struct packet_reader *reader)
  {
enum protocol_version version = protocol_unknown_version;
@@ -85,7 +114,7 @@ enum protocol_version discover_version(struct packet_reader 
*reader)
/* Maybe process capabilities here, at least for v2 */
switch (version) {
case protocol_v2:
-   die("support for protocol v2 not implemented yet");
+   process_capabilities_v2(reader);
break;
case protocol_v1:
/* Read the peeked version line */
@@ -293,6 +322,98 @@ struct ref **get_remote_heads(struct packet_reader *reader,
return list;
  }
  
+static int process_ref_v2(const char *line, struct ref ***list)

+{
+   int ret = 1;
+   int i = 0;


nit: you set 'i' here, but first use it in a for loop with blank 
initializer. Perhaps keep the first assignment closer to the first use?



+   struct object_id old_oid;
+   struct ref *ref;
+   struct string_list line_sections = STRING_LIST_INIT_DUP;
+
+   if (string_list_split(_sections, line, ' ', -1) < 2) {
+   ret = 0;
+   goto out;
+   }
+
+   if (get_oid_hex(line_sections.items[i++].string, _oid)) {
+   ret = 0;
+   goto out;
+   }
+
+   ref = alloc_ref(line_sections.items[i++].string);
+
+   oidcpy(>old_oid, _oid);
+   **list = ref;
+   *list = >next;
+
+   for (; i < line_sections.nr; i++) {
+   const char *arg = line_sections.items[i].string;
+   if (skip_prefix(arg, 

Re: [PATCH v2 14/27] connect: request remote refs using v2

2018-01-31 Thread Derrick Stolee



On 1/31/2018 3:10 PM, Eric Sunshine wrote:

On Wed, Jan 31, 2018 at 10:22 AM, Derrick Stolee <sto...@gmail.com> wrote:

On 1/25/2018 6:58 PM, Brandon Williams wrote:

  +static int process_ref_v2(const char *line, struct ref ***list)
+{
+   int ret = 1;
+   int i = 0;

nit: you set 'i' here, but first use it in a for loop with blank
initializer. Perhaps keep the first assignment closer to the first use?

Hmm, I see 'i' being incremented a couple times before the loop...


+   if (string_list_split(_sections, line, ' ', -1) < 2) {
+   ret = 0;
+   goto out;
+   }
+
+   if (get_oid_hex(line_sections.items[i++].string, _oid)) {

here...


+   ret = 0;
+   goto out;
+   }
+
+   ref = alloc_ref(line_sections.items[i++].string);

and here...


+
+   oidcpy(>old_oid, _oid);
+   **list = ref;
+   *list = >next;
+
+   for (; i < line_sections.nr; i++) {

then it is used in the loop.


+   const char *arg = line_sections.items[i].string;
+   if (skip_prefix(arg, "symref-target:", ))
+   ref->symref = xstrdup(arg);


Thanks! Sorry I missed this.

-Stolee


Re: [PATCH v2 10/27] protocol: introduce enum protocol_version value protocol_v2

2018-02-05 Thread Derrick Stolee

On 2/2/2018 5:44 PM, Brandon Williams wrote:

On 01/31, Derrick Stolee wrote:

On 1/25/2018 6:58 PM, Brandon Williams wrote:

Introduce protocol_v2, a new value for 'enum protocol_version'.
Subsequent patches will fill in the implementation of protocol_v2.

Signed-off-by: Brandon Williams <bmw...@google.com>
---
   builtin/fetch-pack.c   | 3 +++
   builtin/receive-pack.c | 6 ++
   builtin/send-pack.c| 3 +++
   builtin/upload-pack.c  | 7 +++
   connect.c  | 3 +++
   protocol.c | 2 ++
   protocol.h | 1 +
   remote-curl.c  | 3 +++
   transport.c| 9 +
   9 files changed, 37 insertions(+)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 85d4faf76..f492e8abd 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -201,6 +201,9 @@ int cmd_fetch_pack(int argc, const char **argv, const char 
*prefix)
   PACKET_READ_GENTLE_ON_EOF);
switch (discover_version()) {
+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
case protocol_v0:
get_remote_heads(, , 0, NULL, );
diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index b7ce7c7f5..3656e94fd 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -1963,6 +1963,12 @@ int cmd_receive_pack(int argc, const char **argv, const 
char *prefix)
unpack_limit = receive_unpack_limit;
switch (determine_protocol_version_server()) {
+   case protocol_v2:
+   /*
+* push support for protocol v2 has not been implemented yet,
+* so ignore the request to use v2 and fallback to using v0.
+*/
+   break;
case protocol_v1:
/*
 * v1 is just the original protocol with a version string,
diff --git a/builtin/send-pack.c b/builtin/send-pack.c
index 83cb125a6..b5427f75e 100644
--- a/builtin/send-pack.c
+++ b/builtin/send-pack.c
@@ -263,6 +263,9 @@ int cmd_send_pack(int argc, const char **argv, const char 
*prefix)
   PACKET_READ_GENTLE_ON_EOF);
switch (discover_version()) {
+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
case protocol_v0:
get_remote_heads(, _refs, REF_NORMAL,
diff --git a/builtin/upload-pack.c b/builtin/upload-pack.c
index 2cb5cb35b..8d53e9794 100644
--- a/builtin/upload-pack.c
+++ b/builtin/upload-pack.c
@@ -47,6 +47,13 @@ int cmd_upload_pack(int argc, const char **argv, const char 
*prefix)
die("'%s' does not appear to be a git repository", dir);
switch (determine_protocol_version_server()) {
+   case protocol_v2:
+   /*
+* fetch support for protocol v2 has not been implemented yet,
+* so ignore the request to use v2 and fallback to using v0.
+*/
+   upload_pack();
+   break;
case protocol_v1:
/*
 * v1 is just the original protocol with a version string,
diff --git a/connect.c b/connect.c
index db3c9d24c..f2157a821 100644
--- a/connect.c
+++ b/connect.c
@@ -84,6 +84,9 @@ enum protocol_version discover_version(struct packet_reader 
*reader)
/* Maybe process capabilities here, at least for v2 */
switch (version) {
+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol_v1:
/* Read the peeked version line */
packet_reader_read(reader);
diff --git a/protocol.c b/protocol.c
index 43012b7eb..5e636785d 100644
--- a/protocol.c
+++ b/protocol.c
@@ -8,6 +8,8 @@ static enum protocol_version parse_protocol_version(const char 
*value)
return protocol_v0;
else if (!strcmp(value, "1"))
return protocol_v1;
+   else if (!strcmp(value, "2"))
+   return protocol_v2;
else
return protocol_unknown_version;
   }
diff --git a/protocol.h b/protocol.h
index 1b2bc94a8..2ad35e433 100644
--- a/protocol.h
+++ b/protocol.h
@@ -5,6 +5,7 @@ enum protocol_version {
protocol_unknown_version = -1,
protocol_v0 = 0,
protocol_v1 = 1,
+   protocol_v2 = 2,
   };
   /*
diff --git a/remote-curl.c b/remote-curl.c
index 9f6d07683..dae8a4a48 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -185,6 +185,9 @@ static struct ref *parse_git_refs(struct discovery *heads, 
int for_push)
   PACKET_READ_GENTLE_ON_EOF);
switch (discover_version()) {
+   case protocol_v2:
+   die("support for protocol v2 not implemented yet");
+   break;
case protocol

Re: [PATCH v2 04/14] commit-graph: implement construct_commit_graph()

2018-02-05 Thread Derrick Stolee

On 2/2/2018 10:32 AM, SZEDER Gábor wrote:

Teach Git to write a commit graph file by checking all packed objects
to see if they are commits, then store the file in the given pack
directory.

I'm afraid that scanning all packed objects is a bit of a roundabout
way to approach this.

In my git repo, with 9 pack files at the moment, i.e. not that big a
repo and not that many pack files:

   $ time ./git commit-graph --write --update-head
   4df41a3d1cc408b7ad34bea87b51ec4ccbf4b803

   real0m27.550s
   user0m27.113s
   sys 0m0.376s

In comparison, performing a good old revision walk to gather all the
info that is written into the graph file:

   $ time git log --all --topo-order --format='%H %T %P %cd' |wc -l
   52954

   real0m0.903s
   user0m0.972s
   sys 0m0.058s


Two reasons this is in here:

(1) It's easier to get the write implemented this way and add the 
reachable closure later (which I do).


(2) For GVFS, we want to add all commits that arrived in a "prefetch 
pack" to the graph even if we do not have a ref that points to the 
commit yet. We expect many commits to become reachable soon and having 
them in the graph saves a lot of time in merge-base calculations.


So, (1) is for patch simplicity, and (2) is why I want it to be an 
option in the final version. See the --stdin-packs argument later for a 
way to do this incrementally.


I expect almost all users to use the reachable closure method with 
--stdin-commits (and that's how I will integrate automatic updates with 
'fetch', 'repack', and 'gc' in a later patch).





+char* get_commit_graph_filename_hash(const char *pack_dir,
+struct object_id *hash)
+{
+   size_t len;
+   struct strbuf head_path = STRBUF_INIT;
+   strbuf_addstr(_path, pack_dir);
+   strbuf_addstr(_path, "/graph-");
+   strbuf_addstr(_path, oid_to_hex(hash));
+   strbuf_addstr(_path, ".graph");

Nit: this is assembling the path of a graph file, not that of a
graph-head, so the strbuf should be renamed accordingly.


+
+   return strbuf_detach(_path, );
+}




Re: [PATCH v2 05/14] commit-graph: implement git-commit-graph --write

2018-02-05 Thread Derrick Stolee

On 2/1/2018 6:48 PM, SZEDER Gábor wrote:

Teach git-commit-graph to write graph files. Create new test script to verify
this command succeeds without failure.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  Documentation/git-commit-graph.txt | 18 +++
  builtin/commit-graph.c | 30 
  t/t5318-commit-graph.sh| 96 ++
  3 files changed, 144 insertions(+)
  create mode 100755 t/t5318-commit-graph.sh

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index c8ea548dfb..3f3790d9a8 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -5,3 +5,21 @@ NAME
  
  git-commit-graph - Write and verify Git commit graphs (.graph files)
  
+

+SYNOPSIS
+
+[verse]
+'git commit-graph' --write  [--pack-dir ]
+

What do these options do and what is the command's output?  IOW, an
'OPTIONS' section would be nice.


+EXAMPLES
+
+
+* Write a commit graph file for the packed commits in your local .git folder.
++
+
+$ git commit-graph --write
+
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
new file mode 100755
index 00..6bcd1cc264
--- /dev/null
+++ b/t/t5318-commit-graph.sh
@@ -0,0 +1,96 @@
+#!/bin/sh
+
+test_description='commit graph'
+. ./test-lib.sh
+
+test_expect_success 'setup full repo' \
+'rm -rf .git &&
+ mkdir full &&
+ cd full &&
+ git init &&
+ git config core.commitgraph true &&
+ git config pack.threads 1 &&

Does this pack.threads=1 make a difference?


+ packdir=".git/objects/pack"'

We tend to put single quotes around tests like this:

   test_expect_success 'setup full repo' '
 do-this &&
 check-that
   '

This is not a mere style nit: those newlines before and after the test
block make the test's output with '--verbose-log' slightly more
readable.

Furthermore, we prefer tabs for indentation.


Oops! My bad for using t5302-pack-index.sh as my model for creating test 
scripts. It's pretty old, but I do see some of the newer tests using 
this newer style.



Finally, 'cd'-ing around such that it affects subsequent tests is
usually frowned upon.  However, in this particular case (going into
one repo, doing a bunch of tests there, then going into another repo,
and doing another bunch of tests) I think it's better than changing
directory in a subshell in every single test.


+
+test_expect_success 'write graph with no packs' \
+'git commit-graph --write --pack-dir .'
+
+test_expect_success 'create commits and repack' \
+'for i in $(test_seq 5)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git repack'
+
+test_expect_success 'write graph' \
+'graph1=$(git commit-graph --write) &&
+ test_path_is_file ${packdir}/graph-${graph1}.graph'

Style nit:  those {} around the variable names are unnecessary, but I
see you use them a lot.


+
+t_expect_success 'Add more commits' \

This must be test_expect_success.


+'git reset --hard commits/3 &&
+ for i in $(test_seq 6 10)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git reset --hard commits/3 &&
+ for i in $(test_seq 11 15)
+ do
+echo $i >$i.txt &&
+git add $i.txt &&
+git commit -m "commit $i" &&
+git branch commits/$i
+ done &&
+ git reset --hard commits/7 &&
+ git merge commits/11 &&
+ git branch merge/1 &&
+ git reset --hard commits/8 &&
+ git merge commits/12 &&
+ git branch merge/2 &&
+ git reset --hard commits/5 &&
+ git merge commits/10 commits/15 &&
+ git branch merge/3 &&
+ git repack'
+
+# Current graph structure:
+#
+#  M3
+# / |\_
+#/ 10  15
+#   /   |  |
+#  /9 M2   14
+# | |/  \  |
+# | 8 M1 | 13
+# | |/ | \_|
+# 5 7  |   12
+# | |   \__|
+# 4 6  11
+# |/__/
+# 3
+# |
+# 2
+# |
+# 1
+
+test_expect_success 'write graph with merges' \
+'graph2=$(git commit-graph --write) &&
+ test_path_is_file ${packdir}/graph-${graph2}.graph'
+
+test_expect_success 'setup bare repo' \
+'cd .. &&
+ git clone --bare full bare &&
+ cd bare &&
+ git config core.graph true &&
+ git config pack.threads 1 &&
+ baredir="objects/pack"'
+
+test_expect_success 'write graph in bare repo' \
+'graphbare=$(git commit-graph --write) &&
+ test_path_is_file ${baredir}/graph-${graphbare}.graph'
+
+test_done
--
2.16.0.15.g9c3cf44.dirty






[PATCH v3 01/14] commit-graph: add format document

2018-02-08 Thread Derrick Stolee
Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.
* Optional extensions.

Basic header information is followed by a binary table of contents
into "chunks" that include:

* An ordered list of commit object IDs.
* A 256-entry fanout into that list of OIDs.
* A list of metadata for the commits.
* A list of "large edges" to enable octopus merges.

The format automatically includes two parent positions for every
commit. This favors speed over space, since using only one position
per commit would cause an extra level of indirection for every merge
commit. (Octopus merges suffer from this indirection, but they are
very rare.)

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph-format.txt | 91 +
 1 file changed, 91 insertions(+)
 create mode 100644 Documentation/technical/commit-graph-format.txt

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt
new file mode 100644
index 00..349fa0c14c
--- /dev/null
+++ b/Documentation/technical/commit-graph-format.txt
@@ -0,0 +1,91 @@
+Git commit graph format
+===
+
+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".
+
+- The root tree OID.
+
+- The commit date.
+
+- The parents of the commit, stored using positional references within
+  the graph file.
+
+== graph-*.graph files have the following format:
+
+In order to allow extensions that add extra data to the graph, we organize
+the body into "chunks" and provide a binary lookup table at the beginning
+of the body. The header includes certain values, such as number of chunks,
+hash lengths and types.
+
+All 4-byte numbers are in network order.
+
+HEADER:
+
+  4-byte signature:
+  The signature is: {'C', 'G', 'P', 'H'}
+
+  1-byte version number:
+  Currently, the only valid version is 1.
+
+  1-byte Object Id Version (1 = SHA-1)
+
+  1-byte Object Id Length (H)
+
+  1-byte number (C) of "chunks"
+
+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide offset in current file for chunk to start.
+  (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.)
+
+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the int-ids of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds since EPOCH. The generation number
+  uses the higher 30 bits of the first 4 bytes, while the commit
+  time uses the 32 bits of the second 4 bytes, along with the lowest
+  2 bits of the lowest byte, storing the 33rd and 34th bit of the
+  commit time.
+
+  Large Edge List (ID: {'E', 'D', 'G', 'E'})
+  This list of 4-byte values store the second through nth parents for
+  all octopus merges. The second parent value in the commit data stores
+  an array position within this list along with the most-significant bit
+  on. Starting at that array position, iterate through this list of int-ids
+  for the parents until reaching a value with the most-significant bit on.
+  The other bits correspond to the int-id of the last parent. This chunk
+  should always be present, but may be empty.
+
+TRAILER:
+
+   H-byte HASH-checksum of all of the above.
+
-- 
2.15.1.45.g9b7079f



Re: [PATCH v3 03/14] commit-graph: create git-commit-graph builtin

2018-02-08 Thread Derrick Stolee

On 2/8/2018 4:27 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a '--pack-dir' option.

Why do we want to use "pack" dir, when this is specifically designed
not tied to packfile?  .git/objects/pack/ certainly is a possibility
in the sense that anywhere inside .git/objects/ would make sense,
but using the "pack" dir smells like signalling a wrong message to
users.



I wanted to have the smallest footprint as possible in the objects 
directory, and the .git/objects directory currently only holds folders.


I suppose this feature, along with the multi-pack-index (MIDX), extends 
the concept of the pack directory to be a "compressed data" directory, 
but keeps the "pack" name to be compatible with earlier versions.


Another option is to create a .git/objects/graph directory instead, but 
then we need to worry about that directory being present.


Thanks,
-Stolee


[PATCH v3 09/14] commit-graph: implement --delete-expired

2018-02-08 Thread Derrick Stolee
Teach git-commit-graph to delete the graph files in the pack directory
that were not referenced by 'graph_head' during this process. This cleans
up space for the user while not causing race conditions with other running
Git processes that may be referencing the previous graph file.

To delete old graph files, a user (or managing process) would call

git commit-graph write --update-head --delete-expired

but there is some responsibility that the caller must consider. Specifically,
ensure that processes that started before a previous 'commit-graph write'
command have completed. Otherwise, they may have an open handle on a graph file
that will be deleted by the new call.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 --
 builtin/commit-graph.c | 73 --
 t/t5318-commit-graph.sh|  7 ++--
 3 files changed, 84 insertions(+), 7 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 8c2cbbc923..7ae8f7484d 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,6 +37,11 @@ checksum hash of the written file.
 +
 With `--update-head` option, update the graph-head file to point
 to the written graph file.
++
+With the `--delete-expired` option, delete the graph files in the pack
+directory that are not referred to by the graph-head file. To avoid race
+conditions, do not delete the file previously referred to by the
+graph-head file if it is updated by the `--update-head` option.
 
 'read'::
 
@@ -60,11 +65,11 @@ EXAMPLES
 $ git commit-graph write
 
 
-* Write a graph file for the packed commits in your local .git folder
-* and update graph-head.
+* Write a graph file for the packed commits in your local .git folder,
+* update graph-head, and delete state graph files.
 +
 
-$ git commit-graph write --update-head
+$ git commit-graph write --update-head --delete-expired
 
 
 * Read basic information from a graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 529cb80de6..15f647fd81 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -9,7 +9,7 @@ static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
N_("git commit-graph clear [--pack-dir ]"),
N_("git commit-graph read [--graph-hash=]"),
-   N_("git commit-graph write [--pack-dir ] [--update-head]"),
+   N_("git commit-graph write [--pack-dir ] [--update-head] 
[--delete-expired]"),
NULL
 };
 
@@ -24,7 +24,7 @@ static const char * const builtin_commit_graph_read_usage[] = 
{
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--pack-dir ] [--update-head]"),
+   N_("git commit-graph write [--pack-dir ] [--update-head] 
[--delete-expired]"),
NULL
 };
 
@@ -32,6 +32,7 @@ static struct opts_commit_graph {
const char *pack_dir;
const char *graph_hash;
int update_head;
+   int delete_expired;
 } opts;
 
 static int graph_clear(int argc, const char **argv)
@@ -153,9 +154,68 @@ static void update_head_file(const char *pack_dir, const 
struct object_id *graph
commit_lock_file();
 }
 
+/*
+ * To avoid race conditions and deleting graph files that are being
+ * used by other processes, look inside a pack directory for all files
+ * of the form "graph-.graph" that do not match the old or new
+ * graph hashes and delete them.
+ */
+static void do_delete_expired(const char *pack_dir,
+ struct object_id *old_graph_hash,
+ struct object_id *new_graph_hash)
+{
+   DIR *dir;
+   struct dirent *de;
+   int dirnamelen;
+   struct strbuf path = STRBUF_INIT;
+   char *old_graph_path, *new_graph_path;
+
+   if (old_graph_hash)
+   old_graph_path = get_commit_graph_filename_hash(pack_dir, 
old_graph_hash);
+   else
+   old_graph_path = NULL;
+   new_graph_path = get_commit_graph_filename_hash(pack_dir, 
new_graph_hash);
+
+   dir = opendir(pack_dir);
+   if (!dir) {
+   if (errno != ENOENT)
+   error_errno("unable to open object pack directory: %s",
+   pack_dir);
+   return;
+   }
+
+   strbuf_addstr(, pack_dir);
+   strbuf_addch(, '/');
+
+   dirnamelen = path.len;
+   while ((de = readdir(dir)) != NULL) {
+   size_t base_len;
+
+   if (is_dot_or_dotdot(de->d_name))
+   continue;
+
+   strbuf_setlen(, dirnamelen);
+   strbuf_addstr(, de-&

Re: [PATCH v3 01/14] commit-graph: add format document

2018-02-08 Thread Derrick Stolee

On 2/8/2018 4:21 PM, Junio C Hamano wrote:

Derrick Stolee <sto...@gmail.com> writes:


Add document specifying the binary format for commit graphs. This
format allows for:

* New versions.
* New hash functions and hash lengths.

It still is unclear, at least to me, why OID and OID length are
stored as if they can be independent.  If a reader does not
understand a new Object Id hash, is there anything the reader can
still do by knowing how long the hash (which it cannot recompute to
validate) is?  And if a reader does know what OID hashing scheme is
used to refer to the objects, it certainly would know how long the
OIDs are.

Giving length may make sense only when a reader can treat these OIDs
as completely opaque identifiers, without having to (re)hash from
the contents, but if that is the case, then there is no point saying
what exact hash function is used to compute OID.

So I'd understand storing only either one or the other, but not
both.  Am I missing something?


You're right that this data is redundant. It is easy to describe the 
width of the tables using the OID length, so it is convenient to have 
that part of the format. Also, it is good to have 4-byte alignment here, 
so we are not wasting space.


There isn't a strong reason to put that here, but I don't have a great 
reason to remove it, either.


Perhaps leave a byte blank for possible future use?




+The Git commit graph stores a list of commit OIDs and some associated
+metadata, including:
+
+- The generation number of the commit. Commits with no parents have
+  generation number 1; commits with parents have generation number
+  one more than the maximum generation number of its parents. We
+  reserve zero as special, and can be used to mark a generation
+  number invalid or as "not computed".

This "most natural" definition of generation number is stricter than
absolutely necessary (a looser definition that is sufficient is
"gennum of a child is larger than all of its parents'").  While I
personally think that is OK, some people who floated different ideas
in previous discussions on generation numbers may want to articulate
their ideas again.  One idea that I found clever was to use the
total number of commits that are ancestors of a commit instead (it
is far more expensive to compute than the most natural gennum, but
doing so may help other topology math, like "describe").


It is more difficult to compute the number of reachable commits, since 
you cannot learn that only by looking at the parents (you need to know 
how many commits are in the intersection of their reachable sets for a 
two-parent merge, or just walk all of the commits). This leads to a 
quadratic computation to discover the value for N commits.


I define it this rigidly now because I will submit a patch soon after 
this one lands that computes generation numbers and consumes them in 
paint_down_to_common(). I've got it sitting in my local repo ready for a 
rebase.





+CHUNK LOOKUP:
+
+  (C + 1) * 12 bytes listing the table of contents for the chunks:
+  First 4 bytes describe chunk id. Value 0 is a terminating label.
+  Other 8 bytes provide offset in current file for chunk to start.
+  (Chunks are ordered contiguously in the file, so you can infer
+  the length using the next chunk position if necessary.)

Aren't chunks numbered contiguously, starting from #1, thereby
making it unnecessary to store the 4-byte?

How does a reader obtain the length of the last chunk?  Ahh, that is
why there are C+1 entries in this table, not just C, so that the
reader knows where to stop while reading the last one.  Does that
mean that this table looks like this?

 { 1, offset_1 },

 { 2, offset_2 },
 ...
 { C, offset_C },
 { 0, offset_C+1 },

where where (offset_N+1 - offset_N) gives the length of chunk #N?


This is correct.




+  The remaining data in the body is described one chunk at a time, and
+  these chunks may be given in any order. Chunks are required unless
+  otherwise specified.
+
+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+* The first H bytes are for the OID of the root tree.
+* The next 8 bytes are for the int-ids of the first two parents
+  of the ith commit. Stores value 0x if no parent in that
+  position. If there are more than two parents, the second value
+  has its most-significant bit on and the other bits store an array
+  position into the Large Edge List chunk.
+* The next 8 bytes store the generation number of the commit and
+  the commit time in seconds 

[PATCH v3 03/14] commit-graph: create git-commit-graph builtin

2018-02-08 Thread Derrick Stolee
Teach git the 'commit-graph' builtin that will be used for writing and
reading packed graph files. The current implementation is mostly
empty, except for a '--pack-dir' option.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 .gitignore |  1 +
 Documentation/git-commit-graph.txt | 11 +++
 Makefile   |  1 +
 builtin.h  |  1 +
 builtin/commit-graph.c | 37 +
 command-list.txt   |  1 +
 git.c  |  1 +
 7 files changed, 53 insertions(+)
 create mode 100644 Documentation/git-commit-graph.txt
 create mode 100644 builtin/commit-graph.c

diff --git a/.gitignore b/.gitignore
index 833ef3b0b7..e82f90184d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -34,6 +34,7 @@
 /git-clone
 /git-column
 /git-commit
+/git-commit-graph
 /git-commit-tree
 /git-config
 /git-count-objects
diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
new file mode 100644
index 00..e1c3078ca1
--- /dev/null
+++ b/Documentation/git-commit-graph.txt
@@ -0,0 +1,11 @@
+git-commit-graph(1)
+===
+
+NAME
+
+git-commit-graph - Write and verify Git commit graphs (.graph files)
+
+GIT
+---
+Part of the linkgit:git[1] suite
+
diff --git a/Makefile b/Makefile
index ee9d5eb11e..fc40b816dc 100644
--- a/Makefile
+++ b/Makefile
@@ -932,6 +932,7 @@ BUILTIN_OBJS += builtin/clone.o
 BUILTIN_OBJS += builtin/column.o
 BUILTIN_OBJS += builtin/commit-tree.o
 BUILTIN_OBJS += builtin/commit.o
+BUILTIN_OBJS += builtin/commit-graph.o
 BUILTIN_OBJS += builtin/config.o
 BUILTIN_OBJS += builtin/count-objects.o
 BUILTIN_OBJS += builtin/credential.o
diff --git a/builtin.h b/builtin.h
index 42378f3aa4..079855b6d4 100644
--- a/builtin.h
+++ b/builtin.h
@@ -149,6 +149,7 @@ extern int cmd_clone(int argc, const char **argv, const 
char *prefix);
 extern int cmd_clean(int argc, const char **argv, const char *prefix);
 extern int cmd_column(int argc, const char **argv, const char *prefix);
 extern int cmd_commit(int argc, const char **argv, const char *prefix);
+extern int cmd_commit_graph(int argc, const char **argv, const char *prefix);
 extern int cmd_commit_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_config(int argc, const char **argv, const char *prefix);
 extern int cmd_count_objects(int argc, const char **argv, const char *prefix);
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
new file mode 100644
index 00..a01c5d9981
--- /dev/null
+++ b/builtin/commit-graph.c
@@ -0,0 +1,37 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+
+static char const * const builtin_commit_graph_usage[] = {
+   N_("git commit-graph [--pack-dir ]"),
+   NULL
+};
+
+static struct opts_commit_graph {
+   const char *pack_dir;
+} opts;
+
+
+int cmd_commit_graph(int argc, const char **argv, const char *prefix)
+{
+   static struct option builtin_commit_graph_options[] = {
+   { OPTION_STRING, 'p', "pack-dir", _dir,
+   N_("dir"),
+   N_("The pack directory to store the graph") },
+   OPT_END(),
+   };
+
+   if (argc == 2 && !strcmp(argv[1], "-h"))
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+
+   git_config(git_default_config, NULL);
+   argc = parse_options(argc, argv, prefix,
+builtin_commit_graph_options,
+builtin_commit_graph_usage,
+PARSE_OPT_STOP_AT_NON_OPTION);
+
+   usage_with_options(builtin_commit_graph_usage,
+  builtin_commit_graph_options);
+}
+
diff --git a/command-list.txt b/command-list.txt
index a1fad28fd8..835c5890be 100644
--- a/command-list.txt
+++ b/command-list.txt
@@ -34,6 +34,7 @@ git-clean   mainporcelain
 git-clone   mainporcelain   init
 git-column  purehelpers
 git-commit  mainporcelain   history
+git-commit-graphplumbingmanipulators
 git-commit-tree plumbingmanipulators
 git-config  ancillarymanipulators
 git-count-objects   ancillaryinterrogators
diff --git a/git.c b/git.c
index 9e96dd4090..d4832c1e0d 100644
--- a/git.c
+++ b/git.c
@@ -388,6 +388,7 @@ static struct cmd_struct commands[] = {
{ "clone", cmd_clone },
{ "column", cmd_column, RUN_SETUP_GENTLY },
{ "commit", cmd_commit, RUN_SETUP | NEED_WORK_TREE },
+   { "commit-graph", cmd_commit_graph, RUN_SETUP },
{ "commit-tree&

[PATCH v3 00/14] Serialized Git Commit Graph

2018-02-08 Thread Derrick Stolee
Thanks to everyone who gave comments on v1 and v2.

Hopefully the following points have been addressed:

* Fixed inter-commit problems where certain fixes did not arrive until
  later commits.

* Converted from submode flags ("git commit-graph --write") to
  subcommands ("git commit-graph write").

* Fixed a bug where a non-commit OID would cause a segfault when using
  --stdin-commits. Added a test for an annotated tag.

* Numerous style issues, especially in the test script.

I also based my patches on the branch jt/binsearch-with-fanout to make
use of the bsearch_hash() method.

I look forward to your feedback.

Thanks,
-Stolee

-- >8 --

As promised [1], this patch contains a way to serialize the commit graph.
The current implementation defines a new file format to store the graph
structure (parent relationships) and basic commit metadata (commit date,
root tree OID) in order to prevent parsing raw commits while performing
basic graph walks. For example, we do not need to parse the full commit
when performing these walks:

* 'git log --topo-order -1000' walks all reachable commits to avoid
  incorrect topological orders, but only needs the commit message for
  the top 1000 commits.

* 'git merge-base  ' may walk many commits to find the correct
  boundary between the commits reachable from A and those reachable
  from B. No commit messages are needed.

* 'git branch -vv' checks ahead/behind status for all local branches
  compared to their upstream remote branches. This is essentially as
  hard as computing merge bases for each.

The current patch speeds up these calculations by injecting a check in
parse_commit_gently() to check if there is a graph file and using that
to provide the required metadata to the struct commit.

The file format has room to store generation numbers, which will be
provided as a patch after this framework is merged. Generation numbers
are referenced by the design document but not implemented in order to
make the current patch focus on the graph construction process. Once
that is stable, it will be easier to add generation numbers and make
graph walks aware of generation numbers one-by-one.

Here are some performance results for a copy of the Linux repository
where 'master' has 704,766 reachable commits and is behind 'origin/master'
by 19,610 commits.

| Command  | Before | After  | Rel % |
|--|||---|
| log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
| branch -vv   |  0.42s |  0.27s | -35%  |
| rev-list --all   |  6.4s  |  1.0s  | -84%  |
| rev-list --all --objects | 32.6s  | 27.6s  | -15%  |

To test this yourself, run the following on your repo:

  git config core.commitGraph true
  git show-ref -s | git commit-graph write --update-head --stdin-commits

The second command writes a commit graph file containing every commit
reachable from your refs. Now, all git commands that walk commits will
check your graph first before consulting the ODB. You can run your own
performance comparisions by toggling the 'core.commitgraph' setting.

[1] 
https://public-inbox.org/git/d154319e-bb9e-b300-7c37-27b1dcd2a...@jeffhostetler.com/
Re: What's cooking in git.git (Jan 2018, #03; Tue, 23)

[2] https://github.com/derrickstolee/git/pull/2
A GitHub pull request containing the latest version of this patch.

Derrick Stolee (14):
  commit-graph: add format document
  graph: add commit graph design document
  commit-graph: create git-commit-graph builtin
  commit-graph: implement write_commit_graph()
  commit-graph: implement 'git-commit-graph write'
  commit-graph: implement 'git-commit-graph read'
  commit-graph: update graph-head during write
  commit-graph: implement 'git-commit-graph clear'
  commit-graph: implement --delete-expired
  commit-graph: add core.commitGraph setting
  commit: integrate commit graph with commit parsing
  commit-graph: close under reachability
  commit-graph: read only from specific pack-indexes
  commit-graph: build graph from starting commits

 .gitignore  |   1 +
 Documentation/config.txt|   3 +
 Documentation/git-commit-graph.txt  | 115 
 Documentation/technical/commit-graph-format.txt |  91 +++
 Documentation/technical/commit-graph.txt| 189 ++
 Makefile|   2 +
 alloc.c |   1 +
 builtin.h   |   1 +
 builtin/commit-graph.c  | 335 ++
 cache.h |   1 +
 command-list.txt|   1 +
 commit-graph.c  | 828 
 commit-graph.h  |  60 ++
 commit.c|   3 +
 commit.h  

[PATCH v3 07/14] commit-graph: update graph-head during write

2018-02-08 Thread Derrick Stolee
It is possible to have multiple commit graph files in a pack directory,
but only one is important at a time. Use a 'graph_head' file to point
to the important file. Teach git-commit-graph to write 'graph_head' upon
writing a new commit graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt | 11 ++-
 builtin/commit-graph.c | 27 +--
 commit-graph.c |  8 
 commit-graph.h |  1 +
 t/t5318-commit-graph.sh| 25 +++--
 5 files changed, 63 insertions(+), 9 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 67e107f06a..5e32c43b27 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -33,7 +33,9 @@ COMMANDS
 Write a commit graph file based on the commits found in packfiles.
 Includes all commits from the existing commit graph file. Outputs the
 checksum hash of the written file.
-
++
+With `--update-head` option, update the graph-head file to point
+to the written graph file.
 
 'read'::
 
@@ -53,6 +55,13 @@ EXAMPLES
 $ git commit-graph write
 
 
+* Write a graph file for the packed commits in your local .git folder
+* and update graph-head.
++
+
+$ git commit-graph write --update-head
+
+
 * Read basic information from a graph file.
 +
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 3ffa7ec433..776ca087e8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -1,12 +1,13 @@
 #include "builtin.h"
 #include "config.h"
+#include "lockfile.h"
 #include "parse-options.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--pack-dir ]"),
N_("git commit-graph read [--graph-hash=]"),
-   N_("git commit-graph write [--pack-dir ]"),
+   N_("git commit-graph write [--pack-dir ] [--update-head]"),
NULL
 };
 
@@ -16,13 +17,14 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--pack-dir ]"),
+   N_("git commit-graph write [--pack-dir ] [--update-head]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *pack_dir;
const char *graph_hash;
+   int update_head;
 } opts;
 
 static int graph_read(int argc, const char **argv)
@@ -87,6 +89,22 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+static void update_head_file(const char *pack_dir, const struct object_id 
*graph_hash)
+{
+   int fd;
+   struct lock_file lk = LOCK_INIT;
+   char *head_fname = get_graph_head_filename(pack_dir);
+
+   fd = hold_lock_file_for_update(, head_fname, LOCK_DIE_ON_ERROR);
+   FREE_AND_NULL(head_fname);
+
+   if (fd < 0)
+   die_errno("unable to open graph-head");
+
+   write_in_full(fd, oid_to_hex(graph_hash), GIT_MAX_HEXSZ);
+   commit_lock_file();
+}
+
 static int graph_write(int argc, const char **argv)
 {
struct object_id *graph_hash;
@@ -95,6 +113,8 @@ static int graph_write(int argc, const char **argv)
{ OPTION_STRING, 'p', "pack-dir", _dir,
N_("dir"),
N_("The pack directory to store the graph") },
+   OPT_BOOL('u', "update-head", _head,
+   N_("update graph-head to written graph file")),
OPT_END(),
};
 
@@ -111,6 +131,9 @@ static int graph_write(int argc, const char **argv)
 
graph_hash = write_commit_graph(opts.pack_dir);
 
+   if (opts.update_head)
+   update_head_file(opts.pack_dir, graph_hash);
+
if (graph_hash) {
printf("%s\n", oid_to_hex(graph_hash));
FREE_AND_NULL(graph_hash);
diff --git a/commit-graph.c b/commit-graph.c
index 9a337cea4d..9789fe37f9 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -38,6 +38,14 @@
 #define GRAPH_MIN_SIZE (GRAPH_CHUNKLOOKUP_SIZE + GRAPH_FANOUT_SIZE + \
GRAPH_OID_LEN + 8)
 
+char *get_graph_head_filename(const char *pack_dir)
+{
+   struct strbuf fname = STRBUF_INIT;
+   strbuf_addstr(, pack_dir);
+   strbuf_addstr(, "/graph-head");
+   return strbuf_detach(, 0);
+}
+
 char* get_commit_graph_filename_hash(const char *pack_dir,
 struct object_id *hash)
 {
diff --git a/commit-graph.h b/commit-graph.h
index c1608976b3..726

[PATCH v3 12/14] commit-graph: close under reachability

2018-02-08 Thread Derrick Stolee
Teach write_commit_graph() to walk all parents from the commits
discovered in packfiles. This prevents gaps given by loose objects or
previously-missed packfiles.

Also automatically add commits from the existing graph file, if it
exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 37 +
 1 file changed, 37 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index aff67c458e..d711a2cd81 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -633,6 +633,28 @@ static int if_packed_commit_add_to_list(const struct 
object_id *oid,
return 0;
 }
 
+static void close_reachable(struct packed_oid_list *oids)
+{
+   int i;
+   struct rev_info revs;
+   struct commit *commit;
+   init_revisions(, NULL);
+   for (i = 0; i < oids->nr; i++) {
+   commit = lookup_commit(oids->list[i]);
+   if (commit && !parse_commit(commit))
+   revs.commits = commit_list_insert(commit, 
);
+   }
+
+   if (prepare_revision_walk())
+   die(_("revision walk setup failed"));
+
+   while ((commit = get_revision()) != NULL) {
+   ALLOC_GROW(oids->list, oids->nr + 1, oids->alloc);
+   oids->list[oids->nr] = &(commit->object.oid);
+   (oids->nr)++;
+   }
+}
+
 struct object_id *write_commit_graph(const char *pack_dir)
 {
struct packed_oid_list oids;
@@ -650,12 +672,27 @@ struct object_id *write_commit_graph(const char *pack_dir)
char *fname;
struct commit_list *parent;
 
+   prepare_commit_graph();
+
oids.nr = 0;
oids.alloc = 1024;
+
+   if (commit_graph && oids.alloc < commit_graph->num_commits)
+   oids.alloc = commit_graph->num_commits;
+
ALLOC_ARRAY(oids.list, oids.alloc);
 
+   if (commit_graph) {
+   for (i = 0; i < commit_graph->num_commits; i++) {
+   oids.list[i] = malloc(sizeof(struct object_id));
+   get_nth_commit_oid(commit_graph, i, oids.list[i]);
+   }
+   oids.nr = commit_graph->num_commits;
+   }
+
for_each_packed_object(if_packed_commit_add_to_list, , 0);
 
+   close_reachable();
QSORT(oids.list, oids.nr, commit_compare);
 
count_distinct = 1;
-- 
2.15.1.45.g9b7079f



<    1   2   3   4   5   6   7   8   9   10   >